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ABSTRACT OF THE DISSERTATION 


Reliable and Efficient Parallel Processing 
Algorithms and Architectures 
for Modern Signal Processing 

by 

KuoJuey Ray Liu 

Doctor of Philosophy in Electrical Engineering 
University of California, Los Angeles, 1990 

Professor Kung Yao, Chair 

Least-squares (LS) estimations and spectral decomposition algorithms constitute 
the heart of modern signal processing and communication problems. Implementa- 
tions of recursive LS and spectral decomposition algorithms onto parallel processing 
architectures such as systolic array with efficient fault-tolerant schemes are the major 
concerns of this dissertation. 

There are four major results in this dissertation. First, we propose the systolic 
block Householder transformation with application to the recursive least-squares min- 
imization. It is successfully implemented on a systolic array with a two-level pipelined 
implementation at the vector level as well as at the word level. 
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Second, a real-time algorithm-based concurrent error detection scheme based on 
the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, 
order degraded reconfiguration, and performance analysis are also considered. 

Third, the dynamic range, stability, error detection capability under finite-precision 
implementation, order degraded performance, and residual estimation under faulty 
situations for the QRD RLS systolic array are studied in details. 

Finally, we propose the use of multi-phase systolic algorithms for spectral de- 
composition based on the QR algorithm. Two systolic architectures, one based on 
triangular array and another based on rectangular array, are presented for the multi- 
phase operations with fault-tolerant considerations. Eigenvectors and singular vectors 
can be easily obtained by using the multi-phase operations. Performance issues are 
also considered. 
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Chapter 1 


Introduction 


In order to meet the demand of high throughput and high computational complexity 
of modern signal processing, parallel processing algorithms and architectures have 
been extensively studied and implemented for dedicated applications. Rapid ad- 
vances in VLSI microelectronics make it practical to build low-cost and high-density 
application-specific integrated circuits (ASIC) to meet the demands of speed and 
performance of modern signal processing. Recent VLSI/WSI technology permits the 
building of million of transistors in a single chip, while a large system may require 
hundreds of these chips to function properly. For a complex system, a single fault 
from any part of the system can make the whole system useless. For various critical 
applications, highly-reliable computations are demanded. Fault-tolerance is therefore 
needed in many of these problems. From these reasons, there is a potential prolifer- 
ation of activities in the new area of fault-tolerant signal processing which explores 
reliable ways to implement fast and efficient signal processing algorithms. The goal 
of this dissertation is to study the parallel processing algorithms and architectures as 
well as associated efficient fault-tolerant techniques for modern signal processing. 
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Least-squares (LS) problems and spectral decomposition algorithms constitute the 
heart of modern signal processing and communications applications such as adaptive 
filtering, spectral estimation, array signal processing, channel equalization, etc.. Im- 
plementations of recursive LS and spectral decomposition algorithms onto parallel 
processing architectures such as systolic array with efficient fault-tolerant schemes 
are the major concerns of this dissertation. 

Efficient implementation of the LS algorithm, particularly the recursive LS al- 
gorithm (RLS), is needed to meet the high throughput and speed requirements of 
modern signal processing. There are many possible approaches such as fast transver- 
sal method and lattice method which can perform RLS algorithm efficiently [6, 29]. 
Unfortunately, these methods can encounter numerical difficulties due to the accu- 
mulation of round-off errors under a finite-precision implementation as summarized 
in [18]. This may lead to a divergence of the computations of the RLS algorithm [18]. 
A new type of systolic algorithm based on the QR decomposition (QRD) known as 
the QRD RLS was first proposed by McWhirter in [72]. This algorithm is one of the 
most promising algorithms in that it is numerical stable [6, 54] as well as suitable for 
parallel processing implementation on a systolic array [29, 72]. 

Up to now, most of the QRD RLS implementations were based on the Givens 
rotation method and modified Gram-Schmidt method which both are rank-1 update 
approaches [19, 26, 31, 56, 67, 72, 43]. It is well-known that the Householder trans- 
formation (HT), which is a rank - k update approach, is one of the most computation- 
ally efficient methods to compute QRD. The error analysis carried out by Wilkinson 
[97, 39] showed that the HT outperforms the Givens method under finite precision 
computations. Presently, there is no known technique to implement the HT on a 
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systolic array parallel processing architecture, since there is a belief that non-local 
communications in the implementation are necessary due to the vector processing na- 
ture of the Householder transformation. One of the purposes of this paper is to show 
that we can implement the HT on a systolic array with only local connections. Thus, 
it is amenable to VLSI implementation and is applicable to real-time high throughput 
applications of modern signal processing. 

In Chapter 2, we first propose a systolic Householder algorithm called a systolic 
block Householder transformation (SBHT) to compute the QRD with an implemen- 
tation on a vectorized systolic array. Then a RLS algorithm based on the SBHT 
called SBHT RLS algorithm is proposed to perform RLS operations on the array. 
We shall show that the SBHT array and the SBHT RLS array are generalizations 
of Gentleman-Kung’s QRD array [26] and McWhirter’s QRD RLS systolic array [72] 
respectively. The difficulty in the applications of the above arrays is mainly due to 
the the vectorized operations of the processing cells. This results in a high cell com- 
plexity as well as a high I/O bandwidth. By using a modified HT algorithm proposed 
by Tsao [95], a two- level pipelined implementation of the SBHT RLS algorithm can 
be achieved. That is, the algorithm is pipelined at the vector level as well as at the 
word level. The complexity of the processing cell as well as the I/O bandwidth are 
thus reduced. In general, the cell complexity of the SBHT array is higher and the sys- 
tem latency is longer than that of the conventional Givens rotation implementations. 
With the two-level pipelined implementation, the throughput of the SBHT RLS sys- 
tolic array is as fast as that of McWhirter’s Givens rotation array, and it offers better 
numerical property than the Givens method. In addition, an extension of the SBHT 
RLS array to MVDR beamformation, which is a constrainted RLS problem, is also 
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considered. 


Fault-tolerance has been defined as the ability of a system to execute specified al- 
gorithms correctly regardless of hardware failures and program errors [?]. In order 
to achieve the goal of fault-tolerance, redundancy has to be introduced. When we 
encounter a specific VLSI signal processing problem, an inherent nature of that signal 
processing algorithm can be used to develop a highly efficient specific fault- tolerant 
technique named algorithm-based fault-tolerance. In Chapter 3, we propose a new 
algorithm-based fault-tolerant scheme derived from the inherent nature of the QRD 
LS systolic algorithm called the residual method. For a LS problem, especially in 
communication and signal processing applications, we abstract information from the 
residuals which are the differences of the optimal LS estimations from input data and 
the desired responses. Based on the fact that a QRD LS systolic array can compute 
the residuals of different desired responses simultaneously, an artificial desired re- 
sponse can be designed to detect any error produced by a faulty processor. We show 
that if the artificial desired response is designed as a non-zero linear combination of 
all data inputs, the residual output of this response will be zero if no fault occurred. 
Any fault in the system will cause the residual to be non-zero and thus the fault is 
detected in real-time. Thus, the residual method can be easily incorporated with the 
systolic array antenna beamforming systems such as that considered in [77] and will 
have a great impact on the next generation of radar and sonar systems where the 
fault-tolerance scheme is quite essential for reliable real-time operation. 

Once the fault has been detected, two methods to diagnose the location of the 
faulty row are addressed. The first method, called the flushing fault location method, 
is based on the weight flushing technique to flush the weight vector out by using an 
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identity matrix input. The second method is to use the checksum encoding property 
to detect the row which does not meet the checksum condition. When the faulty row 
is determined, this row and the column associated with the same boundary cell are 
eliminated by a reconfiguration operation. Then the system operates in an order- 
degraded manner which is acceptable in many least-squares applications. 

Though the QRD RLS algorithm is generally recognized as having good numerical 
properties such as numerical stability under finite- precision implementation [8, 60], 
there is no detailed study of this until a recent paper by Leung and Haykin [60]. 
Presently, it is still not known how to obtain the dynamic range of the algorithm in 
determining the word-length to ensure correct operation of the algorithm. 

In Chapter 4, we first observe that the cosine parameters generated by boundary 
cells will eventually reach the quasi steady-state if A is close to one which is gener- 
ally the case. We will show that the quasi steady-state and ensemble values of sine 
and cosine parameters are the same for all boundary cells. It is independent of the 
statistics of the input data sequence and the position of the boundary cell which gen- 
erates the sine and cosine parameters. Simulation results are presented to support 
this observation. These results yield the tools needed to further investigate many 
properties of the QRD LS systolic algorithm. Then, we can obtain upper bounds on 
the dynamic range of the processing cells. Thus, lower bounds on the memory size 
can be obtained from these upper bounds on the dynamic range to prevent overflow 
and to ensure correct operations of the QRD LS algorithm. With these results, we re- 
consider the stability problem under quantization effects with a more general analysis 
and obtain tighter bounds than given in previous work [60]. Two important factors 
of the fault-tolerant capability, the missing error detection and the false alarm effects 
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are also studied in this chapter. 

When the faulty row is found in the fault-tolerant QRD LS systolic array, we enter 
an order degraded operation. In Chapter 5, we study the performance degradation 
when the order of the LS is reduced and give a geometric interpretation. For a 
short-time transient fault, we do not switch to the order degraded operation. Thus 
we propose an approximate method to estimate the optimal residual in this faulty 
situation. 

Computing the spectral decomposition of a matrix is an important issue in many 
modern signal processing and system applications. The feasibility of real-time process- 
ing for sophisticated modern signal processing systems, depends crucially on efficient 
implementation of parallel processing of the algorithms and associated architectures 
needed to perform these operations [14, 58]. While many variations exist in the lit- 
eratures for solving these matrix problems, the heart of all these iterative methods 
are based either on the Jacobi-Hestennes method or the QR algorithm [30, 101, 107], 
While there are some fundamental differences between these two approaches, both al- 
gorithms have good numerical stability and convergence rate properties and thus are 
desirable for possible implementation. Since present VLSI technology is capable of 
building a multiprocessor system on a chip, many researchers have proposed different 
parallel processing architectures to solve eigenvalue and singular value decomposition 
(SVD) problems. 

Presently, there is no known simple efficient systolic array approach for the gener- 
ation of eigenvectors. The main reason is that there is no single architecture that is 
capable of handling all the steps required in the algorithm such that we can pipeline 
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the successive iterations readily. The communication cost among different architec- 
tures is high and the interface problem for an efficient data flow is demanding. In this 
paper, we propose two multi-phase systolic algorithms to solve the spectral decompo- 
sition problem based on the QR algorithm. By multi-phase operations we mean that 
the processing cells can perform different arithmetical operations in different phase 
of the computations. Two systolic arrays, one is triangular and the other is rectan- 
gular, are designed based on the multi-phase concept. A key feature in our method 
for the successfully application of the QR algorithm is that the Q matrix of the 
QR decomposition can be computed explicitly by multiphase operations. With the 
proper feedback of this Q matrix, the QR algorithm can be computed and pipelined 
effectively in a single systolic array. From the accumulation of those Q matrices in 
another array, eigenvectors and singular vectors can be computed without needing 
global communication inside the array. All these issues are considered in Chapter 6. 

Finally, Chapter 7 summarizes the research results of this dissertation and provides 
future research directions in these areas. 
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Chapter 2 


QRD RLS Algorithms Using 
Householder Transformation 

The QRD RLS algorithm is one of the most promising RLS algorithms, due to its 
robust numerical stability and suitability for VLSI implementation based on a sys- 
tolic array architecture. Up to now, among many techniques to implement the QR 
decomposition, only the Givens rotation and modified Gram-Schmidt methods have 
been successfully applied to the development of the QRD RLS systolic array. It is 
well-known that Householder transformation (HT) outperforms the Givens rotation 
method under finite precision computations. Presently, there is no known technique 
to implement the HT on a systolic array architecture. In this chapter, we propose 
a Systolic Block Householder Transformation (SBHT) approach, to implement the 
HT on a systolic array as well as its application to the RLS algorithm. Since the 
data is fetched in a block manner, vector operations are in general required for the 
vectorized array. However, by using a modified HT algorithm, a two-level pipelined 
implementation can be used to pipeline the SBHT systolic array both at the vector 
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and word levels. The throughput rate can be as fast as that of the Givens rotation 
method. Our approach makes the HT amenable for VLSI implementation as well 
as applicable to real-time high throughput applications of modern signal processing. 
The constrained RLS problem using the SBHT RLS systolic array is also considered 
in this chapter. 

In section 2.1, a brief review of the QRD RLS algorithm is given. In section 2.2, 
the SBHT is presented while the SBHT RLS algorithm is considered in section 2.3. 
The two-level pipelined implementation of the SBHT RLS systolic array is discussed 
in section 2.4. Finally, in section 2.5, the constrained RLS problem, as applied to 
MVDR beamformation, using an extension of the SBHT RLS array is presented. 


2.1 QRD Recursive Least-Squares Algorithm 

Consider a n x p real-valued data matrix ,4(n) denoted by 

M n ) = [n(l),lt(2), • • • ,Ii(n)] T = [«(l),a(2), • • • ,o(p)], (2.1) 

a n x 1 desired response vector y(n) = [d(l), d(2), ■ ■ ■ , d(n)] T , a p x 1 weight vector 
w{n ), and a n x 1 residual vector 

e(n) = (e(l), e(2), • • • , e(n)] T = A(n)w(n) - y(n). 

Let the index of performance be defined by the weighted l? norm of 

i{n) = \\e{n)\\l = ||Ae(n)|| 2 = e H (n)A 2 (n)e(n), 

where A(n) = diag[X^ n ~ l \ A* n-2 >, • • - , A, 1] with a real- valued forgetting factor 0 < 
A < 1. Then the LS solution, satisfies 

£m.nH = min ||i4(n)2w(n) - y{n)\\l = ||A(n)u;(n) - y(n) |ft. (2.2) 
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Assume A to be full rank with n > p. Then the QR Decomposition of A(n)A(n) 
yields 

t?(n)A(n).4(n) = [fl(n) T ,0] r , 

where R{n) is a p x p upper triangular matrix and Q(n) is an n x n unitary matrix. 
Thus 

£(n) = !|Q(n)e(n)|| 2 = ||[J2(n)uj(n) - £(n)|| 2 + ||u(n)|| 2 , 

where 

[£ T (n),v T (n)] T = Q(n)A(n)y(n), 

with a p x 1 vector P(n) and a n x 1 vector v(n). The LS solution tb(n) can be 
obtained from 

R{n)w{n) = P{n). (2.3) 

Then 4 mt „(n) = ||n(n)|| 2 . 

For some radar/sonar and communication problems, the weight vector w(n ) may 
not be of direct interest. For example, the residual is of interest in the multiple 
sidelobe canceller adaptive array problem. Let u{n) be the n th new incoming row 
vector of data in A(n) of (2.1). Then the LS residual 

e(n) = v* (n)w(n) — d(n), n = p,p + 1, • • • , (2-4) 

can be obtained without computing w(n ) explicitly. Thus, the complexity of the LS 
problem is further reduced and the solution of e(n) is feasible with a systolic array 
implementation [78]. 

To compute the linear LS problem recursively, a unitary matrix Q{n — 1) is defined 
as 
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■ Q(n-l) | O n _ x ‘ 

Q{n - 1) = 

Cr I 1 J 

and we have 

' \R(n — 1 ) ' 

Q(n — l)A(n)/i(n) = 0 

u T (n) 

Suppose the Givens rotation method is used for the QRD, then u T (n) can be anni- 
hilated by applying a sequence of Givens rotations 

G(n) = G p (n)...G 2 (n)G 1 (n) J (2.5) 
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where K(n ) is a pxp matrix, h(n ) is a p x 1 vector, I n - P -\ is the (n— p— 1) x (n— p— 1) 
identity matrix, and 7(71) is a scalar given by 7(71) = flLi c,(n), n > p with c,(n) is 
the cosine parameter associated with the i th Given rotation. The error vector can be 
transformed by Q(n — 1) to 



' \R(n — 1) ' 


‘ A£(n-1) ' 

Q{n - l)Ae(n) = 

0 


Xv(n — 1) 


u H {n) 


d(n) 


When the Given rotation is applied on it, we get 


G(n)Q(n — l)Ae(n) = 

' R(n) ' 

w(n) — 

' Lin) ' 


0 


u( n ) 


It can be easily seen that the last element of t7(n), u n (7r), can be obtained naturally 
during the triangularization process [78] and is given by 

v n (ri) = \h H (n)P(n — 1) + ^(71)7(77). 

In [78], McWhirter has shown the LS residual e(n) can be expressed as 

e(n) = u„(n)7(n). 

The above results lead to the systolic implementation of QRD RLS algorithm with- 
out computing weight vector explicitly [78]. The systolic array is shown in Fig. 2.1. 
It consists of two parts: a triangular array for computing QRD and a linear column 
array called response array (RA) for computing LS residual. When a new data vector 
is updated, a new triangular matrix R sits in the triangular QR array and a new 
vector P sits in the RA. 
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(1) Boundary Cell 



(2) Internal Cell 

%xn 


1 



t 


%out 


1^ %in — 0 then 

c 1; s 0; y out 

r = A r, 

otherwise 

T' = v/AV+4; 

c-Ar/r'; 

r ^ r '\ 7ou< = C7,„ 

end 


^out * ci iri sAr 
r «- sx tn + cAr 


(3) Final Cell 



Figure 2.2: Processing Cells of QRD RLS systolic array 




2.2 Systolic Block Householder Transformation 


The Givens rotation method discussed above is a rank-1 update approach since each 
input is a row of data. For the systolic block Householder transformation (SBHT), 
we need a block data formulation. Denote the data matrix as 


X(n) = 

and the desired response vector as 


" 1 


' X(n-l) 1 

hi e 
... * 

i 


x ! 
1 

1 


e ft 


nkxp 


( 2 . 6 ) 


y( n ) = 

where X.J is the i th data block, 


T 

X (i — 1 )fc+2 


' yi ' 
y2 


' y(n-l) ' 

. y« . 


y» 


€ ft"*, 


= [X.,1 X.,2 • • • X l p ] 


^(i-ljfc + l.l — 1 )fc+l .2 

X(i-l)k+2,l 2:{,_i)t + 2,2 

Z|fc,l X,k, 2 

and y, is the i ih desired response block. 


y. = 


V(i- i)fe+i 
V(i-l)k+2 

L yik 


X[i-\)k+\,p 

X(i-\)k+2,p 


’ ' ' Xi k,p 


€ ft*. 


€ ft 


kxp 


k is the block size and p is the order (columns) of the system. 
For a rank-/: update QR decomposition, suppose we have 


(2.7) 


( 2 . 8 ) 


(2.9) 


( 2 . 10 ) 
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( 2 . 11 ) 

(2.12) 

(2.13) 

(2.14) 

(2.15) 

(2.16) 


where z € 3f? n . When a vector x is multiplied by T, it is reflected in the hyperplane 
defined by span{ z} x . Choosing z = x ± Hx^ej, where e! = [1, 0, 0, ■ • • , 0] € $R n , then 
x is reflected onto ej by T as 

Tx = ±||x|| 2ei . (2.17) 

That is, all of the energy of x is reflected onto unit vector e! after the transformation. 
We can zero out by applying successive Householder transformations as follows, 
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R (-^(n - 1) 



R<'>(n- 1) 


H (,) (n) 

0 

= 


0 



0 * * • 0 x^ 1- *^ • • • x^ ,— ^ 

L f > y Jv n,p J 


. o,- 

• 0 0 v^. 

• , u, u, x n l+1 , 

x (0 

» *Ni,p J 


for i = 1, • • • ,p, where x^ 0 ,- = x n> ,, R^(n — 1) = R(n — 1), and the resultant matrix 
H(n) is 

H(n) = H (p) (n)H (p ~ ,) (n) • • • H (1) (n), (2.18) 


where each H*'*(n) represents a Householder transformation which zeros out the i th 
column of the updated X^, i.e., x^'J 1 ^. 

To obtain H^(n), denote 


Zl 


r u - <7j 


Opi-ijjt-i 


L n,l 


where r n is the (1,1) element of R(n — 1), <j\ = + ||x nil || 2 . Then from (2.16) 

hn{n) : 0 T 1 h ( 12 )T (n) 


H (1) (n) = 


: i : 0 


L h<V(n) : 0 


I H^(n) 


(2.19) 


where /ijj'(??) is a scalar, hj 2 *(n) is a k x 1 vector, h 2 V(rc) = hj^(n), and H 22 '(n) i 
a k x k matrix and 

2x n.l X n,l 


IS 


H^(n) = l k - 


(2.20) 


Z) 


with = (|zi|| 2 = 2(<7j — crjrn). Define 0i = af — <7ir n , (2.20) can be rewritten i 
a form without multiplication of the number 2 as 


in 


H (1)_T X n, 1 X n,l 

n 22 — 1 — 

01 


17 



In general, 



0 



H (m >(n) = 


0 : I(„_i)jt- P : 0 


(2.21) 


[ HS;'(n) i 0 i 

where Hi7'(n) € 9i pxp is an identity matrix except for the m th diagonal entry; 
6 5f? px * is a zero matrix except for the m th row; H 2 7H n ) = ^[7^ T ( n ); and 


H& ) (n) = I fc - 


x (m-l) x (m-l)T 
A " *n ^n.m 

0m 




kxk 


( 2 . 22 ) 


is symmetric with 0 m = o 2 m - cr m r mTn , where <r 2 m - r 2 mm + ||x^ ^U 2 . It can be easily 
seen that H7 2 («)H 2 J ^(n) = 0, H 2 'i (n)Hj 2 ^(n) = 0, for Vi ^ j. Thus we have the 
following lemma, 


Lemma 2.1 The Householder transformation matrix, H(n) G 3 ? nkxnk ) ts orthogonal 
and is of the form 


H(n) = 


Hn(n) : 0 : H 12 (n) 


0 : I( n — i)Jt— p 0 


L H 2 i(n) : 0 : H 22 (rc) 


with 


H (p) (n)H (p " 1 H«)---H (1) (n), 

(2.23) 


Hn(n) = 

H 22 (n) = (2.24) 
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If the block size k = 1, due to the fact that Givens rotation method is a special case of 
rank-1 update Householder transformation [30], the H matrix in Lemma 2.1 becomes 
a Givens rotation matrix G of the form [78] 


G(n) 


K(n) 1 0 | h(n) 

0 | I n _p_i | 0 

h w (n) | 0 | 7(n) 


where K(n) is a p x p matrix, h(n) is a p x 1 vector, and 7(71) is a scalar given by 
7 (n) = n? = i c,(n), n > p where Ci(n) is the cosine parameter associated with the i th 
Given rotation. 


2.2.1 Vectorized SBHT QRD Systolic Array 


Now we propose a vectorized systolic array to implement the QRD based on the 
SBHT. Similar to the QR triarray of Gentleman-Kung [29], this array has both 
boundary and internal cells. The boundary cell takes an input of block size k from the 
previous processor or directly from the input port, updates its content and generates 
reflection vector, and sends it to the right for the internal cell processing. Define 


>('-D r _ 


‘n, t 


0 ,. 


0(n-l)fc-t 


(«-l ) 7 

* 71,1 


, * = 1, • • • ,p, 


and z, = xj,'; 11 — a,e,, where e, is a zero vector except for a unity at the i th position. 
When an internal cell receives the reflection vector, instead of forming the matrix 
z,zf and performing matrix arithmetics, it performs an inner product operation to 
update its content r tt by doing 


H<->(n)x!;- ,) = *£j ,) - |(z J ■ <-*'), j = r + 1, . 


>P, 


(2.25) 
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and sends the reflected data x n ,j downward for further processing. Fig. 2.3 and 
Fig. 2.4 show the SBHT QRD array architecture and the operations of the processing 
cells respectively. When the block size is k = 1, this vectorized array degenerates to 
the Gentleman-Kung’s Givens rotation triarray. 


2.3 SBHT RLS Algorithm 


The LS problem is to choose a weight vector w(n) € 3? p , such that the block-forgetting 
norm of 


e(n) = 


e,(n) 

e 2 (n) 

L e„(n) J 


= X(n)w(n) - y(n) 


is minimized. That is, 


w(n) = argmin ||e(n)|| Afc , 


where 


W")IIa, = II A,(n )e(n)|| = 


^ • ||e,(n)i, 


Ak{ n ) is a block-diagonal exponential weighting matrix of the form, 


Afc(n) = 


r A n-1 It 0 0 


0 ••• AI k 0 

0 0 I, j 




nkxnk 


and || • || 2 is the Euclidean norm, 


(2.26) 


(2.27) 


0 < A < 1, (2.28) 


(2.29) 


I e .(”)il 2 = Hk(,-i)fc +J (n)| 2 . 
;=i 


(2.30) 


The exponential forgetting weighting A is incorporated in the RLS filtering scheme to 
avoid overflow in the processors as well as to facilitate nonstationary data updating. 
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X 



If ||x|| 2 = 0 then 
a = 0, r = Ar, 
else 

5 - AV + llxlll 

s <— y/s 

o * — S + sAr 
Ar -f s 
x 

r *— $ 
end 


x 



y 


Figure 2.4: The processing cells of the SBHT QRD systolic array. 
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If we know the QRD of the weighted augmented data matrix at time n (in the 


block sense, which is equivalent to the nk snapshots), which is given by 


where 


Afc(n) X(n) i y(n)j = (Qf(n)iQ[(n)] 


R(n) 

0 


u( n ) 

v(n) 


Q(n) 


Qi(») ' 

Q i(n) . 


(2.31) 


and Qi(n) € 3ft pXnfc and Q 2 (n) € 3f J( n *-p)* nfc constitute an orthogonal transforma- 
tion matrix with the former spanning the column space of the weighted data matrix 
A k(n)X(n) and the latter the null space, and R(n) G 5? pxp is an upper triangular 
matrix. The optimal weight vector can be obtained by solving 


R(n)w(n) = u(n). (2.32) 

Obviously, A/t(n)X(n) = Q^(n)R(n). As a result, the weighted optimal residual of 
(2.26) is, 

A t (n)<(n) = Qf(n)R(n)w(n) - Qf(n)u(n) - Q^(n)v(n) 

= -Ctf(n)v(n), (2.33) 

which lies in the null space of the weighted data matrix. 

Now, suppose we have the data matrix up to time n — 1 and the QRD of A k (n — 
l)[X(n — 1) : y(n — 1)], then the recursive LS problem is to efficiently compute the 
optimum residual at time n from the results we have at time n — 1. In particular, we 
are interested in the new n th block of the optimal residual, 

e n (n) = X^w(n) - y n . (2.34) 
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From (2.13), (2.14) and (2.23), (2.31) can be expressed as 





’ H„(n) 

: 0 

i H„(n)‘ 


AR(n — 1) 

Au(n — 1) 

R( n ) 

0 

u ( n ) 

v(n) 

= 

0 

• I(n-l)fc-p 

: 0 

• 

0 

Av(n — 1) 




. H a ,(n) 

: 0 

: H 22 (n) . 


. X n r 

y n 


By recursion on n, we relate Q(n) and Q(n — 1) using (2.15) and have 


Q(n) = 


Hn(n) 0 : Hi 2 (n) 

0 : I(n— ijfc— p : 0 

H 2 i(n) : 0 : H 22 (n) 

Hn(n)Q 1 (n - 1) : H 12 (n) 



Qx(n-l) 

0 


Q 2 (n-1) : 

0 


. o 

h . 


Q 2 (n - 1) 


0 


H 2 i(n)Q,(n — 1) : H 22 

We can see that Q 2 (n) is updated from Qi(n — 1) and Q 2 (n — 1) by 


Q 2 (?i) = 


Q 2 (n - 1) 


0 


H 21 (n)Qi(n - 1) : H 22 

On the other hand, the updated (u T (n), v r (n)] T is 


u (n) 
v(n) 


= Q(n)A*(n) 


y(« - 1) 
y» 


AHn(n)u(n - 1) + H 12 (n)y„ 
Av(n — 1) 


where 


(2.35) 


(2.36) 


(2.37) 


v n = AH 2 i(n)u(n - 1) + H 22 (n)y„. 


(2.38) 
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Therefore, from (2.33), (2.36), and (2.37), the weighted optimal residual vector can 
be obtained from parameters in time n — 1 by 


' |(n - 1| n) 1 T -AQ^(n - l)v(n - 1) - Qf(n - l)H£(n)v n ' 

Afc(n)£(n) = = , 

e n (n ) J [ -H£,(n)v n 

(2.39) 

where £(m|n) denotes the estimate of i at time m, m < n, given all of the data up 
to time n. The new n th block of the optimal residual is then obtained as 

e„(n) = — H^ 2 (n)v n = -H ( 2 1 2 ) (n)Hg ) (n) • • • H$(n)v n . (2.40) 

For the block size of k = 1, all vector parameters in (2.40) become scalars and can 
be expressed as 

e„(n) = - J] c,u n , (2.41) 

i=i 

which was first shown by McWhirter in [78]. Note that there is little difference in the 
optimal residuals estimated by the SBHT and the Givens rotation methods. To be 
specific, the optimal residual vector in (2.40) is given by 

e(„_ 1)it+1 ((n - 1 )k + l|nfc) 

e„(n) = ‘ , (2.42) 

e„*_i(nfc - l|nfc) 

e n k{nk\nk) 

while the optimal residual estimated by the Givens rotation method in (2.41) is 


e n (rc) = e„(n|n). (2.43) 

In this sense, the SBHT RLS gives a better estimate of the residual since it uses 
more data samples to estimate the optimal residual. Take k = 2 as an example; 
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the optimal residual obtained from the SBHT RLS and the Givens methods axe 
[e 2 n-i(( 2 n — l)|2n),e 2 „(2n|2n)] and [e 2n -i((2n — l)|(2n — 1), e 2n (2n|2n)] respectively. 
It is clear now that the SBHT RLS method gives a better estimate for the previous 
residual than the Givens rotation method because the former makes use of the future 
data sample at time 2 n to estimate the residual at time 2n — 1, while the latter does 
not. 

2.3.1 Vectorized SBHT RLS Array 

To obtain the residual vector for RLS filtering in the systolic array, there are two 
possible approaches. The first one is to generalize the architecture of McWhirter’s 
Givens rotation approach [78], A SBHT QRD array is incorporated with a RA as 
shown in Fig. 2.5. Observe that v n in (2.38) results from the reflection computation in 
(2.37), therefore v„ is obtained naturally from the output of the RA. Each boundary 
cell then forms the matrix H 22 and propagates it down diagonal boundary cells. Since 
Hj'jj is generated earlier than for i < j, equation (2.30) has to be computed from 
left to right. Obviously, we prefer to compute (2.40) from right to left such that only 
inner product computations are performed instead of matrix multiplications. As a 
result, each boundary cell performs the matrix multiplication to cumulate H 22 when 
it is propagated down diagonal boundary cells. The matrix multiplications needed in 
the boundary cells in this approach are objectionable since they not only slow down 
the throughput but also increase the complexity of the boundary cells. We note, 
McWhirter’s original approach based on Givens rotation worked well since only scalars 
need to be propagated down the diagonal boundary cells and the multiplications for 
scalars can be in the reverse order. 
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Residual 

vector 


Figure 2.5: The SBHT RLS systolic array obtained by direct generalization of the 
Givens rotation array. 
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Instead of forming the matrix H^j and propagating it down, another approach 
is to use the facts that can be expressed by using (2.22) and the reflection 
vectors are sent to the right from boundary cells as described in Section 2.2.1. From 
these observations, (2.40) can be computed in similar operations performed by the 
internal cells. A generalized architecture, as shown in Fig. 2.6, is thus introduced to 
circumvent this problem. A column array of internal cells called backward propagation 
array (BPA) is added at the right hand side to perform the backward propagation of 
v„. Each row, say the i th one, needs 2(p — i) delayed buffers as shown in Fig. 2.6. 
The v„ obtained at the output of RA is then backward propagated through the BPA. 
From (2.22), each cell of this array performs the operation 


Hg(n)v' n 




<), *=?,••■, 2 , 1 , 


(2.44) 


where v' n is an updated v n . This is a subset of the operations performed by the 
internal cell shown in (2.25). The residual vector is obtained from the top of the 
newly added column array. 

The costs for this proposed architecture are: an increased latency time from (2 p + 
1 )t s of McWhirter's Givens method to 3 pt v , w'here t 3 represents the processing time 
for the scalar operations used in the Givens rotation method and t v is the processing 
time for vector operations used in the SBHT method; the number of delay elements 
needed increases from p to 2 (p — i) = p(p— 1); and p additional internal processing 

cells. The operations of the boundary and internal cells are given in Fig. 2.5. These 
results clearly show that HT can be implemented simply on a systolic array to achieve 
massive parallel processing with vector operations. This provides an efficient method 
to obtain a high throughput rate for recursive LS filtering by using the HT method. 
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Figure 2.6: The SBHT RLS systolic array. 



Backward Propagation 
Array 








2.4 Two-level Pipelined Implementations 

The SBHT QRD array and RLS array discussed in the above sections are derived 
from the conventional Householder transformation shown in (2.16). Due to the vector 
processing nature of the conventional method, the cells of both arrays perform vector 
operations such as inner products. This means the complexity of each cell is high 
and the I/O bandwidth is large in order to have an effective communication of vector 
data. Each cell, due to the complexity for vector processing, may lead to a large 
processor. Clearly, this is not a desirable property for VLSI implementation. Thus, 
we are motivated to find a suitable algorithm to pipeline the data down to the word 
level such that the I/O bandwidth as well as the complexity can be reduced. In 
addition, we still wish to achieve a high throughput which is needed in many modern 
signal processing applications. 

The conventional approach in computing Householder transformation, y = Tq, 
based on (2.16) is to form z and ||z|| 2 from x first and then z r q/||z|| 2 and q — 
2z(z T q/||z|| 2 ) as considered before. It can be stated in the following form: 

HT Algorithm (Conventional) 

Step 1. S IX = j | x 1 1 2 . 

Step 2. If S xx = 0, then y = q. 

Step 3. If S XI ^ 0 then 

(1) s = V^T, z = x + [s,0,0,- • • ,0] T , 

(2) <f> = S xx + sx,, S zq = z T q, 

(3) d- S :] /<f>, y = q - dz. 

In [104], Tsao pointed out that by skipping the computation of ^ and avoiding 
the cumbersome intermediate steps of forming vector z for further computations, a 
modified algorithm with smaller round off error and less operations can be obtained. 
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Only step 3 of the conventional algorithm is to be modified as follows, 

Modified HT Algorithm (Tsao, 1975) 

Step 3. If S xx ^ 0 then 

(1) & ~ "h 

(2) S xq = x T q, 

(3) 3/i = -S xq /s, d={ qi - yi )/(T, y { = q { - dii, i = 2,- ■ ■ ,n. 

With this algorithm, the operations of the cells of the vectorized systolic arrays can be 
modified as shown in Fig. 2.7. As we can see that, for the boundary cell, the vector u, 
which consists of the forgotten diagonal element of the upper-triangular matrix and 
one column of the input data block (updated or not), can be sent out immediately 
when the input x is available without waiting for any computations as required in 
the implementation using the conventional algorithm. Due to this advantage and the 
modified operations in the internal cells, we can then pipeline the vector operations 
down to the word level such that each cell only performs scalar operations which will 
significantly reduce the complexity of the cell. 

A two-level pipelined implementation of the modified HT algorithm is given in 
Fig. 2.8 and Fig. 2.9. The boundary cell performs three major functions - square- 
and-cumulate, square root, and addition. For each data block, the boundary cell 
fetches one data sample, cumulates the square of the sample, and sends the data to 
the right for internal cell processing. When all the data of the block are processed, 
the content of S is then sent down for square-root operation. The resultant s is sent 
to the right for internal cell processing as well as sent down to obtain cr, which is then 
sent to the right when available. At the same time, when an internal cell receives a 
Uj, it multiplies u, with an input x, and cumulates all these products to obtain S. 
When 5 is available, it is sent down for division operation with s, which arrives at 
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( ( 


If ||x|| 2 = 0 then 



r «— s 
end 


If o = 0 then 
y = x. r as A r, 



Figure 2.7: Operations of the processing cells by using modified Householder trans- 
formation. 
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Figure 2.8: Processing cells for two-level pipelined implementation. 
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• • • . 




sendj-ight Ar 

5 - A 2 r 2 
do t = 1, k 

fetch x,; 5 «- 5 + x 2 ; sendjright x,, 
end do 

r' ♦_ Ar; s *- y/S\ send_right s, 
r «_ 5; a <— r' + 5; sendjright a. 


5 = 0. 

d o ? = 1 , A’ + 1 
fetch u,. 

if 2 = 1 then 

5 *— 5 -t- u,Ar; send_right u,. 
else 

. fetch x,_j; 5 = 5 + send_right u t . 

end if 

fetch r' <— Ar; f ♦ S/s , 

fetch o: r *— t\ d *— {r 1 — i)/a. 
end do 
do ? = 1, A- 

y, = x, — send.down y, . 

end do 

.v* 

to 

.Vi 


■T] 



*3 

-T| 



Figure 2.9: Operations of the processing cells 
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the same time, to obtain t; then t is sent down and a again arrives at the same time 
to compute d. To compute y, of (2.8) in Step 3, we need some registers to store u,- 
and Xi temporarily. Since data from the next block is continuously being sent into 
the system, each internal cell needs 2 (k + 3) registers to store u, and as indicated 
in Fig. 2.8. When d is available, yi is then obtained one by one and sent down for 
further processing. For data from the next block, it goes through the same processing. 
When a new d is available in the internal cell, the corresponding Xi and it, are already 
waiting in the registers. Therefore the vector operations are successfully pipelined 
down to the word level. This means that by using the modified HT algorithm, we 
have not only pipelined the SBHT arrays at the vector level but also at the word 
level. The input data is now skewed in the word level as shown in Fig. 2.8 rather 
than in the vector level as shown in Fig. 2.6. 

Since the most time consuming operation of this two-level pipelined implementa- 
tion is the square root operation which is also the critical operation in McWhirter’s 
Givens rotation implementation, the throughput of this two-level pipelined imple- 
mentation is as fast as that of the McWhirter’s Givens array. However, with a longer 
pipeline, a longer system latency for the SBHT method is required. This is due to the 
fact that the registers of the internal cells have to be all filled before we can obtain 
the residual vector. For the SBHT RLS systolic array of order p, we have (p 2 + 3p)/2 
internal cells, including the BPA. Thus, there are a total of (p 2 + 3 p)(k + 3) regis- 
ters for the whole system. The system latency is given by t, = 2 p(k + 4), which 
is linearly proportional to p and k. However, for the Givens rotation method, the 
system latency is only t, = 2p + 1. Comparisons of both RLS arrays based on the 
SBHT and Givens rotation are summarized in Table 2.1. In general, the throughput 
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Givens rotation 

SBHT 

Number of cells 

(p^3p)/2 

(p2+5p)/2 

Number of delay elements 

P 

P(P-1) 

Number of registers 

0 

(p?+3p)(k+3) 

System latency 

2p+l 

2p(k+4) 

Cell complexity 

less 

higher 

Numerical stability 

good 

better 


Table 2.1: Comparisons of the SBHT and Givens rotation methods 

of the SBHT RLS systolic array is as fast as Givens rotation method. Of course, 
while the cell complexity of the SBHT array is higher, it does offer better numerical 
property [107]. A detailed backward error analysis carried out by Wilkinson showed 
that for an n x n matrix A , after n(n — l)/2 Givens rotations, the roundoff error in 
the upper-triangular matrix is in the order of 0(K g n 3 ^ 2 fi\\A\\) [107, pp.138], while a 
series of n — 1 HT’s gives about 0(K^n/i||A||) [107, pp. 1 60] , with K g and K/, being 
constants and [i a floating machine computation precision constant. 

2.5 Constrained RLS Problems 

In the above sections, we have dealt with an unconstrained RLS problem. The RLS 
systolic array considered there was motivated originally by the sidelobe canceller 
beamformation problem [78]. Other practical motivation could have come from the 
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adaptive filtering problem [32]. However, there are other signal processing applica- 
tions which are modeled by a constrained RLS problem. The MVDR beamforming 
constitutes such an example [80, 85, 97]. It is interesting to determine whether a sys- 
tolic array for an unconstrained RLS problem can also be used for a constrained RLS 
problem. In [80], McWhirter and Shepherd showed an extension of the unconstrained 
RLS array to the MVDR beamforming problem. Based on their approach, we shall 
also demonstrate the implementation of a MVDR beamformation problem using a 
SBHT RLS array. 

The MVDR beamforming problem is to minimize 

= ||X(n)w('»(n)|| Al £=1, (2.45) 

subject to the linear constraints of 

c ( ' ) 7 V')(n) = /?('\ £= 1 , •••,!, (2.46) 


where L is the number of constraints. We are interested in the a posteriori residual 


vector 


4'>(n) = X>W(n)- (2-47) 

The optimal solution of the weight vector is known [80] to be given by 


w (<) (n) — 


/?^M J (n)c^' /?^R 1 (n)a^^(n) 


(2.48) 


cWTM-i(n)c(') ||a<'>(n)j| 2 

where M = X T (n )A l( n)X(n) is the weighted covariance matrix, R(n) is the upper 
triangular matrix resulted from the QRD of the weighted data matrix A* X(n), and 

a ( ^(n) = R- T (n)c w . (2.49) 


Therefore the optimal residual vector at time n is 

*LV>- ^ 


||aW( n ) 


• X^R -1 (n)a^(n). 


(2.50) 
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A crucial step needed is for the efficient recursive updating of a^(n). A novel ap- 
proach was proposed for performing this updating [80]. Specifically, from (2.13), 
(2.14), and (2.49), 

c ( ^ = R r (n — l)a^'(n — 1) 

Aa^(n — 1) 

= A- 2 [ ARr(n-l) ; 0 r ; X n ] AbW(n-l) , (2.51) 

0 

where b^(n — 1) is an arbitrary ((n — 1 )k — p) x 1 vector. Then from Lemma 2.1, 
(2.13), and (2.14), we have 

Aa (/) (n — 1) 

C ( '> = A -2 [ AR r (n — 1) : 0 r ; X„ ] H T (n)H(n) AbW(n - 1) 

0 

= R T (n)T- 2 (AH„(n)aW(n-l)). (2.52) 

Thus, a^(n) = A -2 (AHn(n)a^(n — 1)) can be obtained by updating a^*(n — 1) in 

a way similar to that u(n) is obtained by updating u(n — 1) using (2.37). The only 
differences are the input for updating a*^(n — 1) is a zero vector and a scaling factor 
A -2 . Due to the structure of H in Lemma 1 , the vector b ( ^(-) plays no role in the 
updating of a^(-). Furthermore, from (2.32) and (2.34), we have 

e n (n) = X^R- 1 (n)u(n)-y„. (2.53) 

From (2.37), we see that u(n) results from the update of [y(n — 1) : y n ], where y n is 
the new input. Now replacing u(n) with a ( ^(n) and y n with a zero vector, we have 

e n (n) = X^R' 1 (n)a^(n), (2.54) 
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and from (2.50), we then obtain 



/?(') 

l|a ( '>(n)ll J ' en(n) 


(2.55) 


This equation reveals that by the proper scaling of e n (n), which can be obtained 
from the SBHT RLS systolic array, we can obtain the a posteriori residual vector, 
e^(n), of the MVDR beamformation. Fig. 2.10 shows an extension of the SBHT 
RLS array for the new problem. Now one more data channel is needed for the RA to 
pipeline cumulation of ||a^l(n)|| 2 , and the scaling of the residual vector is done at the 
bottom of the RA when a new ||a^(-)j| 2 is available. Each RA/BPA pair in Fig. 2.10 
represents one of the K constraints. The optimal a posteriori residual vector of each 
linear constraint is obtained at the output of the corresponding backward propagation 
array. 

As pointed out in [80], there are two ways to initialize the array. One method is to 
set R(0) = 61, where 8 is a small scalar, and thus from (2.49), a^(0) = <5 -1 c^, t = 
1, • • • , L. Another method is to obtain R(n) to some time n, then use (2.49) to obtain 
aP)(n). The details of a two- mode operation required for this initialization procedure 
are also considered in [80]. 
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Figure 2.10: The MVDR beamforming systolic array 








Chapter 3 


Real-Time Algorithm-Based 
Fault-Tolerance for QRD RLS 
Systolic Array 


In this chapter, we propose a new algorithm-based fault-tolerant method derived from 
the inherent nature of the QR least-squares systolic algorithm. Since the residuals 
of different desired responses can be computed simultaneously, an artificial desired 
response can be designed to detect an error produced by a faulty processor. We 
show that if the artificial desired response is designed as some proper combinations 
of the input data, the output residual of the system will be zero if there is no fault. 
However, any occurring fault in the system will cause the residual to be non-zero and 
the fault can be detected in real-time. Once the fault has been detected, the system 
enters into the fault diagnosis phase from the concurrent error detection phase. Two 
methods, the flushing fault location and the checksum encoding methods, can be 
used to diagnose the location of the faulty row. When the faulty row is determined, 
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this row and the associated column with the same boundary cell are eliminated by a 
reconfiguration operation. Then the system degrades in a graceful manner which is 
generally acceptable for many least-squares applications. Those eliminated processors 
enter into a self-checking phase, and when the transient fault condition is removed, a 
reconfiguration is performed to resume the normal full order operation. The analysis 
of error propagation and recovery latency is also considered in this chapter. 

This chapter is organized in the following manner. In section 3.1, we review 
some previous works in algorithm-based fault-tolerance and their advantages and 
disadvantages. In section 3.2, we address the fault model. Various fault-tolerance 
schemes, including concurrent error detection, fault diagnosis, and order degraded 
reconfiguration are presented in section 3.3. In section 3.4, we prove the residual 
method is robust and discuss some important latencies. Finally, in section 3.5, we 
present some conservation tests which may also be used to detect faults. 

3.1 Algorithm-Based Fault-Tolerance 

For a given algorithm, some inherent nature of that algorithm can be used to de- 
velop a highly efficient specific fault- tolerant technique denoted as algorithm-based 
fault-tolerance. The term algorithm-based means it is an algorithm-specific and not 
a general scheme that can be applied to all general problems. An inherent property 
of many matrix computations is that checksums preserved after the computations. 

j 

This observation has led to the discovery of algorithm-based fault-tolerance which 
was first studied at University of Illinois at Urbana. A recently reported algorithm- 
based fault- tolerant technique, called checksum (and weighted checksum) encoding 
scheme proposed by Hwang, Abraham, Jou, Chen et al., has evolved from the study 
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of VLSI matrix computations [3, 4, 5, 6], This scheme belongs to the category of infor- 
mation redundancy [21]. Since few hardware and time redundancies are necessary, it 
is promising for its low-cost and low overhead for VLSI/WSI multiprocessor systems. 
Many applications of the checksum (or weighted checksum) scheme have been suc- 
cessfully applied to various signal processing and linear algebra operations [13]. The 
major drawback of the checksum scheme proposed in [6] is that the system through- 
put will be slowed down because the system clock has to be extended long enough 
to accommodate the longer signal path of the non-local interconnection caused by 
the checksum scheme. Unfortunately, local connection is one of the basic desirable 
requirements of implementations. Define the checksum vector 

e T = [l,l, •••,!] 

and weighted checksum vector 

el = [wi,t02, •••,«»„], 

the column, row and full weighted checksum matrices A cw ,A rw , and Aj w of a square 
n-by-n matrix A as 


A 


cw 




Ae Ae w j , 

At Ae w 
e T .4e e T Ae w 
elAe e^Ae w 


(3.56) 


(3.57) 


(3.58) 
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Figure 3.11: Matrix-matrix multiplication and LU decomposition with (weighted) 
checksum encoding 

If there is any fault occurring during the computation, the checksum criterion is 
not met and thus the fault is detected. The checksum and weighted checksum 
fault-tolerance schemes have been proposed by Huang, Jou, Chen, and Abraham 
[14, 15, 33, 41] to various signal processing and linear algebra operations such as ma- 
trix multiplication and inversion, QR decomposition, LU decomposition and singular 
value decomposition (SVD) et ai. For examples, Fig. 3.11 shows the applications 
of checksum scheme to matrix-matrix multiplication and LU decomposition. The 
weighted checksum scheme can be further used to correct errors [41, 2]. The num- 
ber of error to be corrected depends on how many weighted checksum vectors are 
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used. The major advantages of (weighted) checksum scheme are low-cost and low 
overhead for error detection and correction of multiple processor systems. However, 
the disadvantages are: 

1. There is the problem to distinguish roundoff errors from those caused by failures 

[47]; 

2. If the intermediate results of each processor are continuously propagated to 
other processors such as QRD LS systolic array, the system throughput will 
be slowed down because the system clock has to be extended long enough to 
accomodate the longer signal path of the non-local interconnections caused by 
the checksum scheme. 

There are some difficulties in applying checksum schemes to system problems such 
as communication data equalization, and radar/sonar adaptive antenna array, where 
the signal arrives continuously and high throughput rate is required. As an example, 
for a recursive QRD LS systolic array, the data is coming in row by row. As pointed 
out in [17], the QRD of a row checksum encoding matrix A r results in a row checksum 
encoding upper triangular matrix Rr. That is, A T — QR r . If the checksum scheme 
is used to detect error, during the recursive operation, when a new upper triangular 
matrix R is formed, to obtain a new checksum for each row, we need to sum up each 
elements of the corresponding row. Suppose only local connection, which is one of the 
basic desirable requirements of VLSI implementations, is allowed, to prevent severe 
throughput degradation, a new channel used to pipe out the partial sum of matrix R 
has to be built. While a new content of each cell is available, this content has to be 
added to the partial checksum sent from the previous cells through the new checksum 
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channel and then the partial checksum has to be passed to the right for the next cell. 
With these requirements and operations, not only the complexity is increased but 
also the throughput is hindered. Even when global communications of the processors 
are allowed, the checksum scheme for fault detection still reduces the throughput of 
the system as pointed out above. 

Most recently, [4] has proposed a new algorithm-based fault-tolerant technique 
applicable to the recursive LS triangular systolic array. With the addition of one 
extra column of processor array, errors in the tri-array can be detected by observing 
the scalar output of this column array. While there are some similarities between the 
results reported in [4] and that considered in this chapter (as well as that in an earlier 
version of our paper [65]), there are much differences in assumptions, techniques, and 
results between these two approaches. In any case, these two works were performed 
independently of each other. 

3.2 Fault Model 

As the VLSI technology progresses, the geometric features become smaller. Any 
defect affecting a given part of the circuitry may cause an entire module or a logic 
block to become faulty and to produce arbitrary errors. Thus, the traditional gate- 
level single stuck-at fault model is no longer appropriate for VLSI/WSI system. A 
cell or module is allowed to produce arbitrary errors if any part of the cell is under 
failures [17]. However, we assume that at most one cell can be faulty at a given short 
period of time. This is based on the assumption that the system reliability is such 
that the mean time between failures is long enough and the probability of more than 
one fault occurring is very small. Some basic assumptions we need are as follows: 
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1. If any part of the cell become faulty, the whole cell will not function correctly; 


2. The probability of the communication links and registers failing is very small 
and thus negligible [55]. 

The second assumption is reasonable since these components are typically much sim- 
pler and smaller than the processing cells themselves [55]. In addition, they can be 
implemented conservatively with high redundancy or with self-testing circuitry to 
mask a possible fault. 

3.3 Real-Time Fault- Tolerance 

While a demanding system is usually expected to operate under a high performance 
condition, when a fault has occurred, a slightly degraded performance is often accept- 
able under the circumstance. For a recursive QRD LS systolic array operating under 
a normal fault-free condition, the optimum LS solution is attained. A reduced order 
degraded LS solution can still yield a reasonable performance if the row and column 
with the faulty processor can be eliminated from the computation. This concept leads 
to our proposed design of a gracefully degraded fault-tolerance approach for the QRD 
LS systolic array. 

3.3.1 Concurrent Error Detection - Residual Method 

An inherent nature of the QRD LS systolic algorithm is that for a given data matrix 
A, the minimization of j| Au>,(n) — J/-(«)1|a for many desired response vectors y ,i 6 7, 
can be performed concurrently by appending some more RA’s to the systolic array 
such as shown in Fig. 3.12. The output of the i th RA, e,(n), is the minimal residual 
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Figure 3.12: Recursive LS systolic array with multiple desired responses 

of u T (n)tc,(n) — d,(n), where di(n) is the n tk desired response of i th input vector 
Ko belon § to c °l umn space of A. That is, € span{a(i),l < t < p}, 
then c ; a(j). The optimal LS residual and the associated weight vector 

for are e 0 (n) = 0 and u>,(n) = [ci, c 2 , • • • , Cp] r , for n > p. The actual selection of 
{cj, c 2 , • • • , Cp) will be given later. Various extentions of these fundamental properties 
of the optimum residual of a LS estimation problem form the basis of the proposed 
residual method approach toward concurrent error detection. Denote to be the 
artificial desired response (ADR) and the associated RA as the error detection 
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array (EDA). 


Lemma 3.1 Given the ADR = a(p), the contents of the EDA are identical to 
the contents of the p th column array of the QRD triangular array, and the optimal 
residual e 0 , the output of the p th cell of the EDA, is always zero. 

Proof: Both arrays are p x 1 vector. The first p — 1 elements of both arrays are 
identical given by the fact that the same data are rotated by the same c, and 
generated by boundary cells PEa, 1 < i < p — 1, where PEa denoted the processor 
at position (i,i). Thus the outputs of the (p — l) t/l cells of both arrays are identical. 
Initially, the contents of the p th cells of both array are zeros. Let the first non-zero 
output of (p— l) (/> cells be x, by the update equations of both cells, the first non- zero 
contents of both p th cells equal x. If the second non-zero output of (p — \) th cells 
is z, the updated content of the p th boundary cell equals \/X 2 x 2 + z 2 and that of 
the p th cell of the EDA is s v z + CpA = \/X 2 x 2 + z 2 , where s p = z/\/X 2 x 2 + z 2 and 
c p = Ax/\/A 2 x 2 4- z 2 . Therefore, the contents of both array are identical. Since the 
rotation coefficients, c p and s p , generated by PE pp boundary cell are proportional to 
x and c respectively, the output of the p th cell of the EDA, e 0 , is CpZ — s p x which is 
always zero.D 

If there is a fault in either the p th column of the array or the EDA, these contents 
are no longer identical and then lead to a non-zero e 0 . Thus a fault is detected. 
However, if there is any fault outside of these two arrays, then the errors produced 
by that fault will affect both of these arrays in the same manner (i.e., contents of 
both arrays are still identical) and resulting in a zero e 0 . Thus, these faults will not 
be detected by the y ^ = a(p) design. Clearly, we can generalize the above results by 
the following Lemma. 
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Lemma 3.2 Given the ADRy^ = a(k), for 1 < k < p, the contents of the k th column 
array of the QRD triarray and the first k cells of the EDA are identical. The output 
of the k th cell of the EDA is zero. The contents of cell l, k + 1 < l < p, of the error 
detection array are all zeros. 

Corollary 3.1 A fault occurring outside of the k th column of the QRD triarray and 
the EDA will not be detected if the ADR is designed as = a(k). 

Proof: From the previous discussion, a fault occurring in the i th , 1 < i < k — 1, 
column of QR array will not be detected. From Lemma 3.2, the output of the k th 
cell and the contents of cell /, k + 1 < / < p, of the EDA are all zeros. Thus, any 
fault occurring to the i th , k + 1 < i < p, column of QR array will be masked by 
these zeros. The optimal residual e 0 is always zero unless there is any inconsistency 
between the k ih column of the QR array and the EDA.D 

From all the above observations, by selecting y^ properly as given in the following 
theorem, we can detect the presence of a fault in any location of the system. 

Theorem 3.1 (Concurrent Error Detection Theorem) Consider the selection 
of the artificial desired response y^ = ]F?=i a(i). If there is no fault in the system, 
then the output of the EDA with y^ as an input yields e 0 = 0. If there is a fault in 
the system, then e 0 ^ 0. 

Proof: From Lemma 3.2, each a(i) is ’’zeroed out” by the i th cell of the EDA. Any 
error produced by a faulty processor, say in the j th column of the QR array, will not 
be zeroed out by the j th cells of the EDA. The output of the j th cell is then non-zero 
and propagates down to the output. Therefore, whenever ej ^ 0, there is a fault in 
the system. □ 
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The ADR y^ = £f =1 a(i) is obtained by implementing a top row encoding array 
(EA) consisting of p summing cells as shown in Fig. 3.13. The response array (RA), 
with the desired response y as input and e as output, located at the right of the EDA 
is incorporated with the system to produce the desired residual. Once eo ^ 0, which 
indicates the system had a fault, then e(n) is considered to be in error and will not 
be used. The error detection is thus achieved in real-time. 

Example 3.1: An adaptive filter using QRD LS systolic array with order p = 3 
is simulated. In between t = 25 and t = 35, a fault occurred in cell PE 23 in such a 
way that random noise within range [—1, 1] is generated. From Table 3.2, we can see, 
due to the propagation delay which will be considered in Section 3.4.2, from t = 28 to 
t = 38, e 0 7^ 0 results from errors generated by the faulty cell. The optimal residuals 
in e from t = 28 to t = 38 are then considered faulty. After t = 39, e 0 then decays 
down due to the adaptive nature of the algorithm. Fig. 3.15 plots |e 0 | versus t and 
shows the adaptive effect of the algorithm. A threshold device can be used with e 0 to 
provide a decision on the size of the error that can be tolerated. Fig. 3.16 shows the 
|e 0 1 in Fig. 3.15 with threshold set at 0.3. A generalization of the proposed scheme is 
stated below. 

Theorem 3.2 (Generalized Error Detection Theorem) Any fault occurring in 
the system can be detected if the ADR is given by y jQ = Y7i=\ c »a(0< where c, ^ 0, 1 < 
i < p.D 

The simplest ADR that can detect fault is indeed a checksum encoded data (given 
by Theorem 3.1) which is a special case of the set of ADR given by Theorem 3.2. 
However, unlike the checksum fault detection scheme in [17, 47], Theorem 3.1 and 
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Figure 3.13: Fault-tolerant recursive LS systolic array based on linear combination 
of input data in top row array feeding into column error detection array with output 
fault indication variable eo- 


( 
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(1) Encoding Cell 


Xin 



Xout 


(2) Internal Cell 


Xin 



X out 


(3) Final Cell 


Xin 



Xout 

Figure 3.14: Processing cells of fault-tolerant systolic array. 
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t- 1 

•0-0.000000 

•-0. 000000 

f 2 

•0-0.000000 

•-0. 000000 

t-3 

•0-0.000000 

•-0. 000000 

t- « 

•0-0.000000 

••0.000000 

f 5 

•0-0.000000 

•-0. 000000 

ft 

•0-0.000000 

•-0. 000000 

t- 7 

• 0—0.000000 

•-0. 000000 

t«8 

•0--0. 000000 

•—0.000000 

t-9 

• 0—0.000000 

•—0.000000 

t- 10 

• 0 — 0.000000 

•—0.000000 

t-n 

•0-0.000000 

•—0.000000 

t- 12 

•0-0.000000 

•-0. 000684 

f 13 

•0-0.000000 

•—0.153411 

f 14 

•0-0.000000 

• — 0.138241 

f IS 

•0-0.000000 

•-0. 155885 

fit 

•0—0.000000 

•-0. 065978 

f 17 

• 0 — 0.000000 

•-0. 507390 

t-ie 

•0—0.000000 

•-0. 000090 

f 15 

•0-0.000000 

•—0.704896 

f 20 

•0—0.000000 

•-0. 018910 

t- 21 

•0-0.000000 

•-0. 053229 

f 22 

•0-0.000000 

•—0.787624 

t- 23 

• 0—0.000000 

•-0. 015445 

t- 24 

•0-0.000000 

•—0.287406 

f 25 

•0-0.000000 

•—0.677166 

t- 26 

•0-0.000000 

•—0.050505 

f 27 

•0-0.000000 

•-0. 817616 

f 28 

• 0 — 0.966001 

•—0.553705 

t-29 

• 0—0.611989 

•-0. 325937 

t- 30 

•0-1 .083666 

•-0. 102332 

1-31 

• 0—0.506277 

•—0.100179 

t- 32 

• 0 — 0.391151 

•—0.430207 

t-33 

•0-0.538918 

•-0. 689049 

t-34 

• 0—0. 4266*79 

« — 0.220B55 

f 35 

• 0 — 0.558384 

•-0. 312747 

f 36 

•0-0.975315 

• — 0.117668 

f 37 

• 0—0.748951 

•-0.038787 

f 38 

•0-0.144688 

•-0. 219978 

f 39 

• 0—0.053715 

•-0. 386494 

f 40 

• 0—0.014142 

•-0. 095640 


Table 3.2: The output of e 0 and e of an adaptive QRD LS filter 
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Time 


Figure 3.15: Plot of |e 0 | of an adaptive QRD LS filter with order p = 3 when a fault 
occurred in PE 23 from t = 25 to t = 35. 



Time 


Figure 3.16: Plot of |e 0 | in Fig. 3(a) with threshold set at 0.3. 
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3.2 provide a real-time fault detection scheme using the inherent nature of QRD RLS 
systolic array. 


Optimal Efficiency 


Since a column of linear EDA and a row of linear EA are required, the complexity 
of this fault detection scheme is 2p. That is, 2 p redundant processors is required. 
Compare to the complexity of the triangular QR LS array, (p 2 + 3p)/2, it is a cost- 
effective real-time fault detection scheme. Here we do not count the final output 
multiplier cells in the EDA and RA. 

We define the hardware efficiency 0/, to be the ratio of the hardware cost of 
implementing the algorithm to the cost of implementing the algorithm with an error- 
detection capability. We see 


flfc(p) = 


p 2 + 3 p 


(3.59) 


p 2 + 7p 

Thus 1/2 < D/,(p) < 1 since a single error can be detected by duplicating the hard- 
ware. When flk(p) = 1, we say the error-detection scheme is most hardware efficient. 
Define the time efficiency Q. t to be the ratio of the time to implement the algorithm 
and the time to implement the algorithm incorporating the error- detection scheme. 
Obviously, time efficiency is bounded by 0 < 0 t < 1. When Cl t = 1, we say the error- 
detection scheme is most time efficient. If an error-detection scheme is both most 
hardware and time efficiency, then it is said to be optimal. For the proposed residual 
method, clearly lim p _ 00 fl/ l (p) = 1 as shown in Fig. 3.17. That is, it is asymptotically 
most hardware efficient. However, the time efficiency Q ( (p) = l,Vp, so that it is 
also most time efficient. Therefore, the residual method is an asymptotically optimal 
error-detection scheme for the recursive LS systolic array. 
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Hardware Efficiency 



Order of Least-Squares (p) 

Figure 3.17: Plot of the hardware efficiency of the residual method versus the order 
of the LS estimation. 

3.3.2 Fault Diagnosis 

When a fault is detected, the system leaves the concurrent error detection phase and 
enters the fault diagnosis phase. The main purpose of this phase is to find the faulty 
processor row. Either of two methods, the flushing fault location (FFL) method or 
the checksum encoding (CSE) method, can be used to diagnose and locate the faulty 
row. The FFL method is developed under the assumption that only the residual 
output eo can be accessed externally, while the CSE method assumes that all the 
cells of the EDA can be accessed. 
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Flushing Fault Location Method 


During the concurrent error detection phase, a fault is detected based on unknown 
values of the incoming data in a(k), l < k < p, and contents of all the cells. However, 
in the FFL method, we will control the desired incoming data as well as the contents 
of the tri-array and the EDA, in order to obtain an appropriate value in e 0 to locate 
the faulty processor row. In the FFL method, we do not use the operations of the 
RA cells or the EA cells. From (2.3), the weight vector w = (uq, u> 2 , • • • , w p ) can be 
solved by using the back substitution method of 

Pi - Lj=i + 1 r.jWj 


w, = 


-, i = p, p- 


(3.60) 


where P x and r tJ are elements of vector P_ and matrix R. A linear array to performed 
the back substitution as in [29] can be prevented by using the weight flushing technique 
[106], 


In a fault-free triangular LS systolic array, by ’’freezing” the QRD upper triangular 
matrix R(n) and the associated column vector F(n) in (2.3), the optimum solution 
w(n ) can be ’’flushed” out sequentially with a skewed identity matrix input. In these 
operations, the internal cells (defined in Fig. 3.17) in a given row, act as pure bypass 
elements with no Givens rotations, when the input to the boundary cell of that row 
is zero and c = 1 and s = 0 are being propagated. For example, let p = 3 and 
£ = (£, £, £}), then 


^3 — -£/ r 33i 

^2 — {Pi — r 23^3)/ r 22i 

u>j = (Pj - r l2 w 2 - r l3 w 3 )/r n . (3.61) 


However, due to errors generated by the faulty processor, various parts of R(n) 
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and P_ stored in the array are no longer correct. A new test triangular matrix T 
and a test vector P, are load to the tri-array and EDA respectively. The values of 
T and can be either pre-stored or distributed by an host computer when a fault 
is detected. Specifically, define T = [tj, < 2 , • • • ,t p ] as a p x p all-l’s upper triangular 
matrix, where t, is a p x 1 vector, and P_ t as a p x 1 all-1 ’s vector. Since JP t = t p , 
the optimal solution vector w# = [0,0, - • • , 1], as given by (2.3). One of the reasons 
for the selection of these all l’s in T and P_ t is to reduce memory requirement. Only 
one-bit register is required for each cell or distribution requirement. 

With T and P_ t frozen, consider a skewed pxp identity matrix input to the first 
p columns and all zeros input to the EDA. In the absence of a fault, components of 
the optimum solution vector, w n = [0. 0. • ■ • , 1], are outputted sequentially at e 0 [106]. 
Denote <p T = [0, • • • , 0, 1, 0, • • • , 0] as a 1 x p vector of all zeros except for an one at the 
i th position, representing the i th row of the skewed identity matrix input to the array. 
Then the output e 0 (l) in response to 4? is processed by all the cells from the first to 
the p th row of the array. In general, e 0 (f) is the response to 4 'J {i- e - e o(i) = 
and is processed by all the cells from the i th to the p th row. As considered above, in 
the absence of a fault, e 0 (t) = 0, 1 < i < p — 1, and e 0 (p) = 1. However, with a fault 
in the k th row, then e 0 (i) ^ 0, 1 < i < k , but responses due to 4? for k + 1 < j < p, 
encountering only fault free processing cells from the j th to the p th row, will yield the 
correct value. This property can be used to locate the faulty row in the FFL method. 

Theorem 3.3 (Flushing Fault Location Theorem) When a fault is detected and 
the system, enters the fault diagnosis phase, both the EA and RA arrays are made in- 
operative and all 1 ’s are loaded into the contents of the processor cell. A skewed 
identity pxp matrix is flushed into the system with all zeros input to the EDA. The 
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EDA output e 0 (i),l < i < p is obtained sequentially. Assume the first zero output at 
eo (k + 1) = 0, occurs for some k , 1 < k < p — 2, then the faulty processor is in the 
k ttl row. If there is no such k for eo {k + 1) = 0, 1 < k < p — 2, and eo (p) = 1, then 
the (p — l) th row is the faulty row; otherwise, with e 0 (p) ^ 1, the p th row is the faulty 
row.O 

Example 3.2: A QRD RLS array with order p = 5 is considered. Suppose a 
fault has occurred in PE 34 . When a skewed 5x5 identity matrix is flushed into the 
system, due to the randomly generated noise from the faulty cell PE 34 , the outputs 
from e 0 are given by [0.2127, —0.5714, 0.7453, 0., 1.]. In the absence of an error, the 
outputs should be [0., 0., 0., 0., 1.]. Since the first three elements are erroneous, based 
on Theorem 3.3, the faulty cell is in the third row. 

It is obvious the flushing of 0 is unnecessary since computations involving the 
entire QR array are definitely incorrect in the fault diagnosis phase. 

Checksum Encoding Method 

The basic assumption of the CSE method is that all the cells of the EDA can be 
accessed. Further more, all the contents of the tri-array can be piped out in the 
diagnosis phase. In this chapter, instead of using the CSE as a fault detector as 
in [17], it is used to diagnose fault location when a fault has been detected. The 
disadvantages of the checksum scheme for real-time application is thus prevented. It 
has been shown in [17, 47] that the QRD of a row checksum matrix A r results in 
a row checksum upper triangular matrix R T . Let r tJ (n) be the content of processor 
PEij of the tri-array at time n and P,(n) be the content of the i th processor of the 
EDA at time n. 
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Theorem 3.4 (Checksum Encoding Theorem) Given the artificial desired re- 
sponse i/o = ULi Qf«o(0) f or Q i 4 1 0) the checksum 

X] Q i r i(i+k){n + k) = Pi(n + p - i + 1), (3.62) 

k = 0 

holds for i = 1, 2, • • • ,p, n — p,p + l,p + 2, • • •, if no fault has occurred. If there is 
an m such that the checksum does not hold for m < i < p, then there is a fault in the 
system and the faulty processor is in the first row that does not meet the checksum .□ 

The time indices are introduced to describe the time difference for a given row 
input of data to the array. Each column of inputs, say a{k), is zeroed out by the k th cell 
of EDA. Thus the content of the k tfl cell of EDA is affected only by a(i), k < i < p. 
Therefore, if there is a faulty processor, say in row m, then all the rows below do not 
satisfy the checksum because of the error produced by the faulty one which cannot 
be zeroed out by the m ih cell of EDA. 

Example 3.3: Consider a [f? : £o] matrix of intermediate results for a QRD RLS 
array with order p = 4, 


0.7930 

0.7462 

0.4655 

0.9774 1 

: 2.9S21 

0. 

1.2973 

1.0816 

1.0379 : 

! 3.4168 

0. 

0. 

0.2729 

0.3643 ! 

! 1.7132 

0. 

0. 

0. 

0.1675 : 

: 0.8375 


where the (i,i + k) element, r, of R , takes the value at time n + k due to time 
skewing of the input. That is, r ti (j + fc)(n + k). The i th element of JP takes the value at 
time n + 5 — i. That is, P,(n + 5 — i). As we can see, starting at the third row, the 
checksums are no longer consistent for each row. From Theorem 3.4, the faulty cell 
is in the third row. 
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Unlike the FFL method, the CSE method cannot stop the concurrent error detec- 
tion phase immediately when a fault is detected. Because of the skewed manner of 
inputting the data, if we stop the operation immediately, the checksum property will 
not hold according to Theorem 3.4. Each processor PEij, 1 < i < j < p, of the QR 
tri-array has to take j — i more data and the i tfl cell of the EDA has to take p + 1 — i 
more data so that the checksum is satisfied for each row of the systolic array. If each 
data requires one system clock, we observe that at most p more system clocks are 
needed to process those unfinished data after the moment a fault is detected. The 
last row takes one clock and the first row takes p clocks. Generally, the i th row takes 
p + 1 — i clocks to process the unfinished data. Thus, those rows which take fewer 
clocks can pipe their final results out to the right to check their checksum while others 
rows are still working on their unfinished data. 

3.3.3 Order-Degraded Reconfiguration 

An order-degraded performance is reasonable and often acceptable in many LS appli- 
cations. A reconfiguration is needed to reroute data paths for order- degraded opera- 
tion. Many models and approaches can be found in the literatures [58, 44, 57, 92] for 
the reconfiguration of VLSI array processors. Here we use a similar model described 
in [58]. When the faulty row, say row k, is determined, the cells in the k th column and 
row become connection elements and enter a dormant state. In the dormant state, 
each cell tests itself to check its status repeatedly [58]. The reduced (p — 1) x (p — 1) 
tri-array then operates in an order-degraded LS computational manner. Fig. 3.18 
shows an example of bypassing the faulty row and the associated column to become 
an order-degraded LS array. When the (transient) fault is removed, the dormant cells 
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Figure 3.18: Reduced (p — 1) x (p — 1) tri-array after the deletion of the row and 
column with a faulty cell. 

reactivate and generate an interrupt immediately. The reactivation scheme recovers 
all of the cells which become connection elements before and turns them into active 
cells. Then the full-order LS operation is resumed. Detail reviews on various schemes 
and technologies of reconfiguration can be found in [13]. 


3.4 Error Propagation and Recovery Latency 


In this section, we derive and compare the performances of the proposed fault-tolerant 
schemes. 
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3.4.1 Robust Error Detection 


Some definitions needed in the analysis are first given. A processor is said to be 
faulty if a fault has occurred in it and is said to have been contaminated if it contains 
erroneous data or states. The faulty moment is the time instant a fault first occurs 
within the processor, and the contaminating moment is the time instant the first error 
occurs in the processor [98]. A fault is said to be observable if some of the observable 
states , such as signal data and system output, of the processor can be affected by 
the fault. It may take some latent period to change the observable states incorrectly 
once the fault occurs. An immediately observable fault is a fault which affects some 
of the observable states right after its occurrence. To make the analysis tractable, 
we first assume all faults are immediately observable faults such that when the fault 
occurs, it produces error at the output of the faulty processor immediately. Later, 
we will relax this assumption for more general results. The error is then propagated 
and contaminates those processors affected. The moment the error is first observed 
at the output of the system is called the error observed moment. First, assume a 
fault occurs in an internal cell PE t] ,i ^ j, at a faulty moment. The output of this 
faulty cell is thus erroneous and can be described by x c out = x oui ■+■ 6 , where x out is 
the fault-free output and <5 is the error generated by the fault. The error propagation 
path can be described by 


PE, } -» PE {i + i)j -*• * PE 1} , 

and then PE^i, k > j, l > j are all contaminated. 

From the operations executed by the internal cell, the error is modified to c t+i 6 by 
PE(i+\)j and the cumulative modifications of the error before reaching the boundary 
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cell, PEjj, is 

j-i 

f] = 6 n c ki (3-63) 

k=i+l 

where c, is the cosine parameter generated by the boundary cell PE^. Therefore, Cj 
and Sj are erroneous and are given by 


A r 

\J A 2 r 2 + (X t - n + T}) 2 

(g»n + rj) 

yJX 1 ^ + (X, n + Tj) 2 


(3.64) 


In this case, s : is no longer proportional to x, n , a(j) will not be zeroed out by the j th 


cell of the EDA. The size of the error generated by this cell is 


Hi = c : x in - SjXr = 


A rr) 

v/r' 2 + 2qx~ + T] 2 ’ 


(3.65) 


where r' = X 2 r 2 + x 2 n is the new updated uncontaminated value of the content of 
PE] j. Although those c k s and s k 's for j < k < p are contaminated, a(k), j < k < p, 
are zeroed out by the k th cell of the EDA because of the consistency of the sine 
and cosine parameters used by the k th column array of the QR tri-array and the 


EDA. When rjj propagates down to the output of the EDA, t]j is influenced by the 
contaminated cosines c' of each following row. The error output at e 0 due to an error 
8 generated at PE i: is then given by 


4(*J) = -7 n C 'r 


r/Xr 


= < 


m=j+l Vr' 2 + 27?I, n + Tj 1 

( ~ 7>r nu , ck rims ,+ 1 c ’™ s { < j _ 2 

^/r'2+2x, n n*+l c ^ + ni=!+i c i p 

l y/r'^ + 'iXtnS+S^ 


(3.66) 


i = j — 1 . 

where 7 = nCi c i n£ =J c ' k . Next, assume a fault occurs in a boundary cell, PE j: , 1 < 
j < Pi at the faulty moment. Both erroneous c'j and s' produced by PEj } can be 
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written by 


/ Ar -t - 8 C t x, n -f" 8 3 

c — s = 

j r i ’ i r > 

' £ £ 

where 8 C and 6, represent errors in the numerators while r' represents the erroneous 
content of the denominators of Cj and Sj. The error produced by the j th cell of the 
EDA is then given by 



rjj = CjX ln - s' Ar = 


( x in 8 c - Xr8 s ) 


(3.68) 


and the output error at e 0 due to a faulty boundary cell is given by 

4uj)=y n 4. ■ ^ ( 3 . 69 ) 

m=j + l r c 

A fault can occur either in the internal or boundary cell. First assume the fault 
occurs in an internal cell. The cosine parameter, Cj, produced by PEjj is bounded 

by 0 < Cj < 1. 


Lemma 3.3 For some c Jt if there exists ann € / such that c : {n) ^ 0, then c_,(m) > 0 
for all m > n, m 6 / . 


Proof: The cosine parameter is given by Ck+i = Ar(A.-)/r'(fc), where r(k + 1) = 
r'(k) = + 3 n — + Cj(n) ^ 0 is equivalent to say that 3n — > r(n) ^ 0 

orr(n) > 0. Since r(A’-fl) = r'(k) > Xr(k), we have r(k) > 0 for all k > n. Therefore, 
Cj(k) > 0 for all k > n.D 

For linearly independent input column vectors, all Cj(n) 0, and from (3.66), we 
see e£ ^ 0 if an error has been introduced. Thus the fault can always be detected in 
this case. When the fault occurs in a boundary cell, from (3.69) we conclude that the 
only chance of a loss of detection is that e£ = 0. This happens only if 8 c /8, = Xr/ii n , 
which is very unlikely since the value of 8 C , 8 a , r, and x, n are not related in any 
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manner. Furthermore, even if 6 c /6 3 equals Ar(n)/x,„(n) for some instant of time at 
n, it is even more unlikely that S c /S a equal Ar(n -f l)/x,„(n + 1) at time n + 1. We 
can summarize these results in the following theorem. Note that the statement of 
Theorem 3.5 is valid only for infinite-precision implementation. 

Theorem 3.5 (Robust Error Detection Theorem) An error produced by a faulty 
processor at the faulty moment will be detected at the EDA output e 0 and the proba- 
bility of error detection given a fault occurs equal one. That is, 

Pr(error detected at eo \ a fault occurred) = !.□ (3.70) 


3.4.2 Latencies 


Now, we consider some basic issues related to latencies in the array. 

Definition 5.1: The system latency , t s , is the time between the moment of data 
input to the system and the moment of the output of this data from the system. 
Definition 5.2: The processing latency , t p , of processor PE, } is the time between 
the moment a data in a wavefront inputs to the system and the moment PE tJ is 
processing data from that wavefront. 

Definition 5.3: The error propagation latency , t e , is the time between the faulty 
moment and the error observed moment. It is clear that the system latency of the 
QR recursive LS array depends on the number of processors and delay elements on 
the boundary and is given by t, = 2p + 1. The processing latency of processor 
PE tJ , 1 < i < j < p + 1, is given by t p = (i + j) — 1. Since there are a totally of 
p(p + 3)/2 processors, the expected processing latency is 


ru \ ^ A v~v • , • i x _ P 2 -f 4p -f 1 

M ~P(P + 3)S£ ( P + 3 ' 
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Figure 3.19: Plot of expected processing latency and expected error propagation 
latency versus the order of the LS estimation. 

The error propagation latency is given by 


t c = t 3 -i p = 2(p + 1) - (i + j), 


(3.72) 


and the expected value is 


E(t c ) = 


(P+ 1)(P + 2) 


(3.73) 


( P + 3) 

Fig. 3.19 shows the plots of E(t p ) and E(i t ). 

Definition 5.4: The fault diagnosis time , tf, of a faulty processor PEi : is the mini- 
mum time required to locate the faulty row right after the error observed moment. 
Definition 5.5: The recovery latency , t T , be the time between the faulty moment 
and the moment the faulty row is determined. Since the system latency is 2p + 1, 
for the FFL method the fault diagnosis time of processor PE X] for the array is 
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t FFL = (2 p + 1) + i. We can show that the fault diagnosis time for the CSE method is 
t ESE = p-j-2 — i. The expected value for fault diagnosis time are E{t FFL ) = (5/2)p + l 
and E{t C j SE ) = (l/2)p + 2 respectively. By the definition of the recovery latency, 
we have t r = t c + tj. Therefore, the recovery latency are t FFL = 4p — j + 3 and 
fCSE _ 3p _ 2 j _ j .j. 4, w hile the expected recovery latency are 


E{t FFL ) 

E{t ESE ) 


7p 2 + 23p+10 

2p + 6 

3p 2 + 13p + 16 

2p + 6 


(3.74) 


respectively. Due to the facts that multiple ports can be accessed externally and we 
can use the parallel pipe-out feature of the CSE method, it is not surprised that the 
performance of the CSE method is better than that of the FFL method as indicated 
in Fig. 3.20. However, for both cases, the order of the expected recovery latency is 
0(p), which is linear with respect to p. In practice, a (transcient) fault may not 
be necessarily an immediately observable fault. Without this assumption, all the 
values obtained in this section become the lower bounds of those parameters. That 
is, performance obtained by the assumption of immediately observable fault is the 
best performance we can achieve. 


3.5 Conservation Test 

Thus far the fault-tolerance system discussed above is designed to detect the fault 
occurring in the QR triarray and the EDA. It is not applicable to the RA which 
computes the real desired response. In this section, we introduce some conservation 
properties to tackle this problem and the data is still assumed to be real. 

A: Cell Level Test 
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Expected Recovery Latency of FFL & CSE 



Figure 3.20: Comparisons of expected recovery latency for FFL and CSE methods. 

Al: Energy Conservation The rotation operation of internal cell is described in 
Fig.l. Denote the updated r as r' . Then it can be seen that 

+ r' 2 = *1 + ^r 2 , (3-75) 


which means the energy, (i.e. the 2-norm) is conserved before and after the 
operation. This is due to the fact that the Givens rotation transformation is an 
unitary transformation which conserves the energy. 

A2: Inverse of Unitary Matrix The above energy conservation test requires the 
square operation which is different from the operation of the internal cell. In a 
VLSI/WSI system, it is desirable to keep the number of different kinds of pro- 
cessors or cells as low as possible [74, 54]. Observe that the inverse of a unitary 
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matrix is the Hermitian transpose of it. If there is no fault occurring during the 
computations, the inverse computation should give the original values. That is 


• 


■ 

■ 


* 

3* in 


c 

s 


3-out 

A r 


—s 

c 


r' 


(3.76) 


One of the above inverse computation is sufficient to detect a fault. For example, 
the inverse computation of x< n = cx out + sr 1 will not match the original x, n , if 
x out or r' is erroneous. This kind of test can be carried out by the internal cell. 


B: System Level Test 

Consider the RA which computes the real desired response. Based on the previous 
discussion, the energy is conserved for unitary transformation. Denote P t (n ), 1 < 

i < p, to be the content of the i th cell of the array at time n and d as defined in 
section 2. The system level energy conservation is described by 

£/?(n + t-l)=£y(;). (3.77) 

>=1 j= 1 

That is, the energy of the input signal is equal to the energy in the RA. Note that the 
introduced time index is used to describe the operations of continuously updating the 
incoming signal. Both cell and system level tests can be built either in the system or 
applied externally depending on the requirements of the application. 
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Chapter 4 


Dynamic Range, Stability, and 
Fault-tolerant Capability of 
Finite-precision QRD RLS 
Systolic Algorithm 


In this chapter, we first observe that the cosine parameters generated by boundary 
cells will eventually reach quasi steady-state if A is close to one which is the usual 
case. We will show that the quasi steady-state and ensemble values of sine and cosine 
parameters are the same for all boundary cells. It is independent of the statistics of 
the input data sequence and the position of the boundary cell which generates the sine 
and cosine parameters. Simulation results are presented to support this observation. 
These results yield the tools needed to further investigate many properties of the 
QRD LS systolic algorithm. Then, we can obtain upper bounds of the dynamic 
range of processing cells. Thus, lower bounds on the memory size can be obtained 
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from upper bounds of the dynamic range to prevent overflow and to ensure correct 
operations of the QRD LS algorithm. With these results, we reconsider the stability 
problem under quantization effects with a more general analysis and obtain tighter 
bounds than given in previous work [60]. Two important factors of the fault- tolerant 
capability, the missing error detection and the false alarm effects are also studied in 
this chapter. 

Quasi steady-state of the rotation parameters is discussed in section 4.1. Dynamic 
range and lower bound on memory size are derived in section 4.2. Stability and 
quantization effects are studied in section 4.3. Finally, the fault-tolerant capability is 
presented in section 4.4. 


4.1 Quasi Steady- State and Ensemble Behavior 

From the updated recursive equation of the boundary cell (Fig. 2.2), we have 

k 

r 2 {k + 1) = A V(fc) + x 2 (k) = ]T A 2t x 2 {k - *). (4-78) 

«=o 

Assume {x} is a bounded random sequence of zero-mean and variance <x 2 , then the 
expected value of r 2 (k + 1 ) is 

k 1 _ \2(fc+l) 

E\r 2 (k + 1)] = £*"£(*’(* - 0) = t* ■ (4.79) 

i=0 1 A 

For |Aj < 1, then 

lim E[r 2 {k)\ = t-~ (4.80) 

k — oo 1 — A* 

Since is a concave function, from Jensen inequality 

y^E{r{k)) < Hm s/E[r 2 {k)} = (4.81) 
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and from (4.78) 


< lim r(k + 1) < 


(4.82) 


where |x moI | and |x m j n j are the maximum and minimum values of {|x|}, respectively. 

The cosine parameter of the Givens rotation is computed by c(fc+l) = A r(k)/r(k+ 
1). The steady-state of this parameter exists if limt_ co c(A:) exists. For the sequence 
(c(-)} to have a steady-state, we need lim^—oo r(k)/r(k+ 1) = a, where a is a constant. 
If q < 1, then the sequence (r(-)} is unbounded which conflicts with (4.82) that 
indicates {;*(•)} should be bounded; if a > 1, then lim*— oo r(k) = 0 which, again, 
conflicts with (4.82) (unless {x} is a zero sequence). Therefore, a has to be a unity 
to guarantee the steady-state of {c(-)} exists. That is, 

r(k) 


lim , , 

fc— oo r(k T 1 ) 

and the steady-state value of cosine, if exists, is 


1 , 


(4.83) 


lim c(k + 1) = lim - = A. 

fc— oo A:-oo r{k + 1) 


(4.84) 


From (4.78), we can see that if A = 1, then lim^,*, r(k) — > oo such that limjt -,00 r(k)/r(k+ 
1) = 1. In this case, though the steady-state of {c(-)} exists, {r(-)} is unbounded. 
Usually A is chosen between .99 and 1 which is very close to one 1 . When we update 
r(k) to r(k + 1) using (4.78), a A portion of r(k ) is forgotten and an input x(k) is 
added into it. If A is close to one, when k is very large, r(k) will come close to r(k+ 1) 
and the input x(k) plays less and less significant role in computing r(k + 1) as the 
case when A = 1. It is obvious that for A close to one, 


lim Er(k) zz lim Er(k - 1- 1). 

k — >oo k-+ oo 


'For different expressions as in [32, 60, 78], A is between .98 and 1 
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Therefore, from the averaging principle [76] which has been used successfully in many 
situations, the expected cosine can be approximated by 


lim Ec(k + 1) ~ A Tr 7 T ~^ it 
*-«> v ’ Er(k + 1) 


When A is close to one, from above discussions, we have 


A. 


(4.85) 


lim c(k) = lim = A + 8( A, x), 

k—oo k -> oo r{k -fi 1 ) 


(4.86) 


where 6(A,x) represents the small deviation due to the forgotten A portion of r and 
input of x. If 8 is very small such that it is negligible when k is large, we say that the 
sequence {c(-)j reaches the quasi steady-state. 

Generally, it is difficult to quantitatively describe 8( A, x). Here we model the input 
signal to the systolic array as a second-order AR process described by 


x(n) -f aix(n — 1) + a 2 x(n — 2) = u(n), (4-87) 

where v(n) is a white Gaussian noise process of zero mean and unit variance. By 
choosing of different AR parameters cq and a 2 , we obtain different realizations of 
the AR process [32]. In our simulations, three different categories of signal are en- 
countered. The first category consists of three stationary AR processes which are 
ARl (oq = —0.1, a 2 = —0.8), AR2 (cq = 0.1, a 2 = —0.8) with real roots and AR3 
(cq = —0.975, a 2 = 0.95) with complex-conjugate roots. The second catogory is a 
non-stationary AR process, AR4 (a! = —0.6, a 2 = —0.5), and the third catogory is a 
white Gaussian noise process, WN, with zero mean and unit variance. All of the AR 
processes are normalized to unit variance. Table 4.3 shows the mean distribution for 
different input data with different A values. This table justifies the result in (4.85). 
Table 4.4 shows the variance distribuation of different input data with different A 
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AR1 

AR2 

AR3 

AR4 

WN ' 

HS&ii] 

.9800 

.9800 

.9802 

.9799 

.9801 

X= 985 

.9849 

.9849 

.9851 

.9848 


A= 990 

.9897 

.9897 

.9900 

.9897 

.9899 i 

X=991 

.9907 

.9907 

.9910 

.9907 


A=993 

.9927 

.9927 

.9930 

.9927 

.9929 

X=995 

.9947 

.9947 

.9950 

.9947 

.9949 

A=. 997 

.9967 

.9967 

.9970 

.9967 

.9969 

^=■999 







Table 4.3: Mean distribution for different input data with different A values 

value. The values of those variances are in the order of 10 -4 to 10 6 which implies 
that 6 is indeed very small. They can be closely approximated by using quadratic 


polynomials as follows, 


ARl : 

<r 2 (A) = 1.5938 - 3.182A + 1.5882A 2 

AR2 : 

<7 2 (A) = 1.5991 - 3. 1919A + 1.5928A 2 

Al/?3 : 

ct 2 (A) = 1.5812 - 3.1595A -f 1 .5784 A 2 

ARA : 

<r 2 (A) = 1.4492 - 2.8936A + 1.4444 A 2 

AR5 : 

<r 2 (A) = 1.6437 - 3.2904A+ 1.6431A 2 (4.88) 

We can see, although the statistics of the input data are different, the variances 
can be described by A in a very similar way (see Fig. 4.21). This means, when A is 
close to one and the quasi steady-state is reached, the size of the variation 6 is mainly 


governed by A instead of the statistics of the input data. Fig. 4.21 shows the plots of 
the variances in dB scale. 
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AR1 

AR2 

AR3 

AR4 

WN 

A= 980 

7.3885e-4 

7.5465e-4 

6.8163e-4 

6.6721e-4 

7.3367e-4 

X=.985 

4.3970e-4 

4.5144e-4 

3.9577e-4 

3.9517e-4 

4.3308e-4 


2.0903e-4 

2.1463e-4 

1.8376e-4 

1.8918e-4 

2.0080e-4 

*.= 991 

1.7154e-4 

1.7875e-4 

1.4883e-4 

1.5562e-4 

1.6659e-4 

X=.993 

1.0991e-4 

1.1390e-4 

9.1016e-5 

9.6440e-5 

1.0323e-4 

*.=.995 

5.9724e-5 

6.0796e-5 

4.6789e-5 

5.1856e-5 

5.3525e-5 

*-=.997 

2.3007e-5 

2.4735e-5 

1.6808e-5 

1.9908e-5 

2.0504e-5 

\= 999 

EIIkhII 


KISRfl 

ElUIRfl 



Table 4.4: Variance distributations for different input data with different A values 

With these results, we conclude that sequence {c(-)} reaches the quasi steady-state 
regardless the input statistics if A is close to one. Thus, we can write 

lim c(fc + 1) ~ lim Ec(k + 1) ~ A, 

k—oa fc—oo ' 

Km s(k +1) ~ Hm Es{k + 1) ~ - A 2 . (4.89) 

The quasi steady-state and ensemble values of sine and cosine parameters are the same 
for all boundary cells. It is independent of the statistics of the input data sequence 
and the position of the boundary cell which generates the sine and cosine parameters. 
These results yield the tools needed to further investigate many properties of the 
QRD LS systolic algorithm. 
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Figure 4.21: Plots of the variances in dB 

4.2 Dynamic Range and Lower Bound on Mem- 


ory Size 


The dynamic range of the content of boundary cell P E\\ can be upper bounded by 
lim r 2 n {k - f 1) = lim £ X 2i x 2 (k - i) < lim x 2 max £ X 2 ' = • ( 4 - 90 ) 

fc-oo m ’ fc-oo )=0 1-A 

|x r 

FnWl ^ " 

> 1 

For internal cell PE\ V we have 


Therefore, 






(4.91) 


|r lj (i + l)| = \s(k)x(k) + c(k)Xr(k)\ 
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(4.92) 


= |s(fc)x(A:) + c(fc)A[s(fc — 1 )x(k — 1) + c(k — l)Ar(fc — 1)]| 

< ^2y\x(k-i)s(k-i)\ l ]\c(k-j) 

4=0 j—O 

< |Zmax|£A‘|s(fc-t)| ri C ( fc -i) 

i=0 j = 0 

From the basic relation between the geometric mean and the arithmetic mean, we 
know 


a\ + a? + • • • + a n 

( ) > a, . a 2 . . . a„. 


n 


(4.93) 


If n is large enough, then from the law of large number, we know 

a i + a.2 -fi • • • + a n 


lim 

n— ►oo 


n 


E(a). 


Therefore, 

E{a) n > JJa,-, 

4 = 1 

when n is large. We can further simplify the bound for k —> oo by using this inequality 
as follows, 


lim | rij(k + 1)| < |x maI | lim ^ A 's{k - i)E{c(k - i))‘ 

k—*oo k—>oc " 

t=0 


= 


^A 2 '. V / T^v = 


= ». 


(4.94) 


vl-A 2 

From (4.90) and (4.93), we can see the steady-state dynamic range of the first row 
is upper bounded by 3? for both boundary and internal cells. The dynamic range of 
the second row depends on the output of internal cells of the first row. Denote the 
output of the first row as x ou( , we have 


x 0 ut{k + 1) = c(k)x(k) — s(k)Xr(k). 

The first term of the right-hand side of (4.95) can be bounded by 


(4.95) 


lim |c(fc)x(/:)| < A 

k — >oo 


(4.96) 
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and from (4.94) the second term is bounded by 

lim \s(k)Xr(k)\ < y/T^V • A -bsL = A|x max |. (4.97) 

k~oo vl — 

There are two possible cases. 

Case 1: Highly fluctuated input 

The value of x(k) may vary differently from time to time such that s(k ) may 
have an opposite sign of x(k). For this case 

lim |x out (fc)| < 2A|x mai |. (4.98) 

K-+ OO 

Case 2: Smooth input 

For this case, the input data sequence does not change its value rapidly, therefore 
s(k) may always have the same sign as x(k). The bound is 

lim |x otlt (fc)| < A|x mox |. (4.99) 

*—►00 

From (4.91) and (4.94), it is obvious the steady-state dynamic range of the second 
row is bounded by 

lim |r 2> (*)| < = 2ASR, (4.100) 

k—oo ^/i — y 

for the highly fluctuated input and 

lim |r 2j (/;)| < A3?, (4.101) 

k—> oo 

for the smooth input. From above results, the steady-state dynamic range of the m ih 
row is bounded by 

lim |r mj (fc)| < (2A) m_1 • 3?, (4.102) 

* — OO 
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for the highly fluctuated input and 


lim \r mj (k)\ < (A ) m -^, (4.103) 

k— >oo 

for the smooth input. 

Fig. 4.22 shows a simulation of the contents of internal cells of the first row and 
the second row. Denote B m as the word-length of the m th row, to prevent overflow and 
to insure the correct operation of the QRD LS algorithm, we require 2 Bm > (2A) m-1 3J 
for fixed point operation, and therefore 

B m > f(m - 1)(1 + log 2 A) + log 2 . (4.104) 


The memory size required for each row to prevent overflow increases and decreases 
for the highly fluctuated and smooth inputs respectively. For the fluctuated input, 
when (2A) m_1 = 2, one more bit is needed for the memory of each successive row. 
The number of rows m for each increase is 


m 


= ri + r 


1 


- 1 , 


(4.105) 


+ log 2 A 

which is a monotonically decreasing function of A. If A < 0.5, then there is no such m 
exists. That is, the memory size of the array can be fixed at 3? without the overflow 
problem. For smooth input, when A m_1 = |, one bit can be discarded from the 
memory of the following rows. The number of rows m for each decrease is 


m 


= fl- 


1 

log 2 A 


1, 


(4.106) 


which is a monotonically increasing function of A. For A < 0.5, m = 2. That is, for 
every two rows we can decrease one bit for the memory. 



A Plot of me Comenu of It* him now (AJWJ 
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A Plot of the Content! of The Ftnt Row (AR3) 
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A Plot of the Comenu of The Fim Row (AR4) 
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4.3 Stability and Quantization Effect 


In this section, we consider the stability under quantization effects. Here, the stability 
is defined in the sense of bounded input/bounded output (BIBO) as in [60]. From 
(4.98) and (4.99), the output of the m th row is bounded by 

lim |x ou t m | < (2A) m-1 |x mox |, (4.107) 

K— ►OO 

for the highly fluctuated input and 

lim |x ou(m | < A m_1 |x max | (4.108) 

k—>oo 

for the smooth input. 

The order p of the LS problem is finite and fixed. The output of the last row of 
the QR triarray is bounded, in the worst case, by lim;t_ 00 |x ou<p | 5- (2A) P-1 |x max |. The 
residual is then asymptotically bounded by 

lim |e(fc)| = lim 7(fc)|x tm( (k)\ < (2A) P_1 |x max |, (4.109) 

k—> oo k — * oo 

where 7 (k) = n P =1 c,(k) [78], Thus, for A < 1, if the input data are bounded, that is, 

| x max I < OO, the output is always bounded. The QRD LS systolic array constitutes 
a BIBO stable system under unlimitted precision condition. In practice, the memory 
size of each processing cell is finite- length. Leung and Haykin [60] first considered 
the stability under this effect and showed the QRD LS algorithm is stable under 

finite-precision implementation. Here we reconsider this problem and obtain a more 

■} 

general analysis and a tighter bound. 

Let Q(-) be the quantization operator and x be the quantized value of x. Since 
the quantization error for the additions of quantized parameters is much smaller than 
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that of the multiplications of them, to make the analysis simplier, we may express 
the quantization error for additions as 

Q(X>)=X>+<* n (4.110) 

i=l i=l 

From (4.78), the square of the quantized content of the boundary cell is 

f 2 (k + 1) = Q(Q(\ 2 f 2 (k)) + Q(x 2 (k))) = - 0) + W (4-111) 

i=0 

Since the quantization operator Q is a bounded operator such that |Q(x)| < A'|x| for 
all x and some I\ [60], (4.111) can be bounded by 

|f 2 (A- + 1)| < A'o|A 2 *± 2 (0)| + /\ 1 |A 2(fc - 1 >i 2 (l)| + • • • + I< k \x 2 (k)\ + 6 k+1 

5- A mor ■ imax(l + A 2 + • • • + \ 2k ), (4.112) 


where x mQX is the maximum quantized value of sequence x. The asymptotic behavior 
can be obtained by taking the limit on both sides. Thus, 

Jim |r 2 (A)| < R’maz ■ x 2 max 1 . (4.113) 

oo 1 — A 2 

Therefore, the quantized content is 


Jim |r(fc)| = Jim Q{yjf 2 {k)) 


< A" 

— max 


7— = K'Jt. 

f Z max 

y/l - A 2 


With the same arguments as in section 4.1, we then have 

fc—oo r(k + 1) 

and the quantized steady-state value of cosine is 

Jim c(k + 1) = Jim ^7TTT\ = 

k~* oo k—+oo r(k + lj 


(4.114) 


(4.115) 


(4.116) 
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and the quantized steady-state value of sine is 


lim $(k + 1) = \/l — A 2 . 

k—KDO 

Analogous to section 4.1, we can further obtain limjt—oo Ec(k) = A and limt—oo E§(k) = 
y/l - A 2 . 

Now consider the quantized content of the internal cell, from (4.92) 


+ 1)1 = |Q(Q(J(*)i(*» + QWfcjAiW)! 

= Y A' |x( A: - i)s(k — i)| JJ c(ir - j )) + 

1=0 j=0 

< K” max\x m ax\ £ ~ 01 II *(* ~ i). (4-117) 

»=0 j =0 

where A'” max results from quantization error including 6*+i. From section 4.2, (4.116), 
and (4.117), the quantized steady-state dynamic range of the internal cell is bounded 

by 


lim |f,j(fc)| < A ” max ======= 

k ~°° y/l - A 2 


= K ” & 

J k 771CLX ■ 


(4.118) 


The output of the m th row is bounded, under the quantization effect, by 


lim M*)| < A'” mar (2A) m_ 3? 

K—+OC 


(4.119) 


for the highly fluctuated input and 


lim MA:)| < 

k—*oc 


(4.120) 


for smooth input. 

From these results, the quantized asymptotic value of the residual can be obtained 


as 


lim |g(Jt)| < A'” max (2A) p - 1 &. (4.121) 

k—>oc 

Thus, if A < 1 and the input data are bounded, the QRD LS systolic array constitute 
a BIBO stable system under the quantization effect. 
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4.4 Finite-length Effect of Fault-tolerant Capa- 


bility 

In this section, we discuss the finite-length effects of fault-tolerant capability. The 
first problem is that of missing error detection which results from the cumulative 
multiplications of the cosine value with a small error. Under a finite-precision imple- 
mentation, this may result in a failure of error detection. The minimum word-length 
to circumvent this problem is then derived. The second problem is called the false 
alarm. With the quantization effects, the system without fault may produce quanti- 
zation error to cause the false alarm. A threshold device is then introduced to tackle 
this problem. 

4.4.1 Missing Error Detection 

By missing error detection we mean that a small error generated by a faulty processing 
cell is not detected due to the finite-precision computation. Assume a fault occurs 
in an internal cell PE lv i j, at a faulty moment. The output of this faulty cell is 
thus erroneous and can be described by x l out = x out + <5, where x out is the fault-free 
output and 6 is the error generated by the fault. The error propagation path can be 
described by 

PE,j -+ PE {i+ 1 )> -» > PEjj, 

and then P Em, k > j, l > j are all contaminated. From the operations executed 
by the internal cell, the error is modified to Ci +i 6 by PE^+i^j and the cumulative 
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modifications of the error before reaching the boundary cell, PEjj, is 


j-i 

n = n c *♦ 

*r=t+l 


(4.122) 


where c, is the cosine parameter generated by the boundary cell PE,, . Let c' and s' 
denote the erroneous Cj and Sj respectively. The o' and s' are then given by 

A r , x in + rj 


c. = 


s' = 


(4.123) 


J \/A 2 r 2 + (Xin + T}) 2 J y^r 2 + (ar<n + V) 2 
In this case, s' is no longer proportional to x tn , a(j) will not be zeroed out by the j th 
cell of the EDA. The size of the error generated by this cell is 

Xrrj 


T)j = c'jXin - s' Ar = 


\J r ' 2 + 2r^i m + T] 2 


= -Cji), 


(4.124) 


where r' = J X 2 r 2 + xj n is the new updated uncontaminated value of the content of 
PE :: . When r/j propagates down to the output of the EDA, r}j is influenced by the 
contaminated cosines c' of each following row. The error output at eo due to an error 
6 generated at PE is then given by 


m=] 


4 ihj) = -7 II <4^ = -7 II c 'mV 

m=j + 1 

= ~i n ■ n c m< 5 > 

k=i+1 m=j 


(4.125) 


where 7 = ULi Q FlL [1], It becomes 


i-i 


4(ij) = - ii c ' n 4 n c m<5- 

1=1 k=i + 1 m=j 


(4.126) 


Next, assume a fault occurs in a boundary cell, PE : j, 1 < j < p, at the faulty 
moment. Both erroneous c' and s' produced by PEjj can be written by 


c i = 


Ar + 5 C 


1 %in “h 

S i = ' 


(4.127) 
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where 8 C and 6, represent errors in the numerators while r' represents the erroneous 
content of the denominators of cj a nd Sj. The error produced by the j th cell of the 
EDA is then given by 


i / . %in&c 

Vi = CjXin - SjXr = . 


(4.128) 


and the output error at e 0 due to a faulty boundary cell is given by 


4u,i) = 7 n 4 - — . Xr6, 

m= j + 1 r c 


= ri c '- n 

/=1 m=j + 1 


(4.129) 


From (4.126) and (4.129), we can see that ^ 0, under unlimited precision condition, 
if there is a fault occurring in the system, except when u in 8 c = A r6 a in (4.128). 
However, this is unlikely to happen. From Lemma 3.3, for 0 < < 1, the error 

may not be detected after multiple multiplications of c, in (4.126) and (4.129) under 
finite-precision implementation. It is obvious there is no such problem when 6 is 
large. Since r in (4.123) tends to be a large number asymptotically, it is reasonable 
to assume the error size 8 generated by a fault is much smaller than r when 8 is small. 
Under this circumstance, from (4.123), we have c' = c y In the quasi steady-state, 
the asymptotic behavior of erroneous cosine is c' == Cj = A. From (4.126) and (4.129), 
the error output e£ due to an error size 8 is then approximated by 

4(>\ j) 3 -\»-6 (4.130) 


for a faulty internal cell and 


4(i,i) = A 2 ’--’,, 


(4.131) 


for a faulty boundary cell. Denote B& be the word- length of each memory of fixed 
point arithmetics. That is, each memory size is of B& bits and let A = min(<5, 7^). 
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To insure the detection of error of size A, we need 


A 2p "‘A > A 2p > 2" B *. (4.132) 

Therefore, the memory size should be at least 

5a > f— 2 plog 2 A — log 2 A] (4.133) 

such that the small error size can be detected. The second term of the right-hand 
size is obvious since the error size A must be detected; the first term is to account 
for the effects that the error propagates through the array. 

4.4.2 False Alarm 

Due to the finite-precision implementation, the residual output of the EDA will not 
be an actual zero if there is no fault in the system. We call this effect a false alarm. 
Here, we are going to model and quantitatively describe the false alarm effect and 
introduce a threshold device to overcome this problem. 

Cancellation Principle 

Suppose now we have a QRD RLS array of order p — 3. Denote the first and second 
rows of data input as (xj, x 2 , x 3 , Xj + x 2 + x 3 ) and (x'j, x 2 , X 3 , x( +X 2 + X 3 ) respectively, 
where the checksums Xj-1- x 2 +x 3 and x'j + Xj-fXg are inputs to the EDA. Fig. 4.23 shows 
the first two rows of the array. After both data pass through the array, according to 
the operations of the processing cells, the contents of the cells of the first row are 

r n = \Jx\ + x' 2 , 
r — sx 2 + cx 2 , 
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( 


X/'+X^fc' 

Xr+&+X« 


*' 6 *s 

Xi x» x* 



Figure 4.23: The first two rows of the array 
r J3 = sx' 3 + cx 3 , 

r J4 = s(x\ + x'j + x' 3 ) + c(xi + x 2 + x 3 ), (4.134) 

where c = Xi/r n and s = x[/r n are the rotation parameters generated by the 
boundary cell and r t j is the content of PEij. The output of the internal cells are 

Z12 = cx\- sx 2 , 

Z 13 = ^3 — SX 3 , 

z M = c(i'j -f x' 2 + 13) - s(xi + x 2 + x 3 ). (4.135) 

Since sx' + cx x — \jx\ 4- x'\ and cx\ — sx\ = 0, we have r 14 = r n + r 12 + r 13 and 

zu = z J2 + 213. That is, both the contents and the outputs of the first row still meet 

the checksum. The output of the first cell of EDA, Z] 4, can be rewritten as 

z M = c( x' 2 + X3) - s(x 2 + x 3 ). (4.136) 

We can see that the data from the first column got cancelled out by the first cell of 
the EDA. Since the outputs meet the checksum, with the same principle, the data 
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from the second column will got cancelled out by the 2 nd cell of EDA. Thus, this 
observation can be generalized and stated as bellowed: 

Cancellation Princeple: With the checksum encoding data inputted to EDA, the 
data from the i th column got cancelled out by the i th cell of the EDA.D 

For a finite-precision implementation, due to the roundoff error, the data from the 
i th column will not be totally cancelled out by the i th cell of the EDA. This effect 
results in the false alarm problem. 

Floating Point Arithmetics 

A floating point number / can be represented by [30] 

/ = ±.did 2 ■ ■ ■ d t x f3 e , 0 < di < (3, d, ^ 0, L < e < U, 

where fi is the base, t is the precision, and [L,U\ is the exponent range 
point operator fl can be shown to satisfy [30] 

x = fl{x) = x(l + e), 
fl(aopb) = (a op b)( 1 -f e) |e| < u, 

where u is the unit roundoff defined by 

u = for rounded arithmetics, 

and op denote any of the four arithmetic operations +, — , x, -K 

Roundoff Analysis 

For a QRD RLS array of order p with floating point arithmetics, denote the first row 
of the input vector as (£,, x 2 , • • • , x p , £f =1 f,- + e p ), where x, = //(x,), t p = e(^f =1 x,), 


(4.137) 
. The floating 


(4.138) 
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and |e| < u is a constant 2 , and the second row of input vector as (x'j, ij, • ■ • , x p , ££=1 x|+ 
c p ). The content of the first boundary cell is given by 

rii = fl{\jx ? + x'l) = yjxl + x'\{\ + e), (4.139) 

and the rotation parameters are c = /Z(xi/fn) and i = fl{x[/r n ). The contents of 
the internal cells then can be obtained as 


r a = //(/Z(ix') + //(ex,)) 

= [sx'(l + c) + cxj(l + e)](l + e) 

~ ( 1 + 2e)(ix' + cij), 1 < j < p, (4.140) 


and 


n, P+ i = fl{fl{s{j^x[ + e p )) + flic^Tx, + e p ))) 

.=1 

~ ($^x' + c]Tx,) + 6e p . 


«=i 


(4.141) 


»=i «=i 

The mismatch r x resulting from the finite precision computations of the first row is 


Tj = 6e p - (cy xj + x'\ + 2e ]T^(ix' + cx 3 )) 


(4.142) 


i=2 


and it is bounded by 


5: 6p|cx mox | |2ex max | -fi 4(p l)|cx max | 

= (10p - 2)Jex mox | < 10p|ex mox |. (4.143) 

For the second row, using the same approach, the mismatch is bounded by 10(p — 
l)|ex max |. The total mismatch from the whole array for a given input vector is given 

2 To simplify the notation, we do not give indeices to different f’s. 
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by 

p-i 

|t| < 5Z 10 (p - Ol eI ma*| = 5 p{p + l)|cx m01 |. (4.144) 

t=0 

Let n denote the number of input data rows. As pointed out in [109], for the Givens 
rotation method, the roundoff error is proportional to 0(n 15 ). The possible mismatch 
becomes 

|r(n)| < 50(n u5 )p{p + l)|ex mar |. (4.145) 

This bound can be interpreted as: for each row of input, each processing cell con- 
tributes about |ex max | amount of roundoff error. Since there are about p(p + 1) 
processing cells, the total possible roundoff error is then p(p + l)|e£ ma x|- 

In order to prevent false alarm, the threshold th has to be set at least at (r(n)|. 
Suppose 0 = 2,< = 16, then u = 2“ 16 . Given an scaled input data such that 
|x m ar| = 1 and n = 1 0 4 , the threshold of a QRD RLS array of order p = 100 is 

th > |r(n)| mor ~ 5 • 10 6 • 10 4 • 2~ 16 = 10~ 4 . (4.146) 

4.4.3 Overall Memory Size Consideration 

To prevent missing error detection, we want to set the detectable error size A = 
min(6,7/j) as small as possible. While to prevent the false alarm, we also want to 
choose a sufficiently high threshold. Both situations cannot be satisfied simultane- 
ously since both goals are conflicting. 

To detect the error size A, from (4.130), (4.131), and (4.132), we set the threshold 
th < A 2p A. That is, the minimal detectable error size is A -2p • th. From (4.133), 

B±<\-\og 2 th]. (4.147) 
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A criterion to choose B& is then given by 


Ba = min( f— 2p log 2 A - log 2 A] , [- log 2 th ] ). (4.148) 

To prevent overflow, from (4.104), set 

B p = Kp — 1)(1 + log 2 A) + log 2 3f?]. (4.149) 

For a QRD RLS systolic array to detect an error size of A without false alarm and 
overflow problems, the memory size B is required to be at least 

B = ma x(B p , B A ). (4.150) 
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Chapter 5 


Order Degraded Performance and 
Residual Estimations 


When the faulty row is found in the fault-tolerant QRD RLS systolic array, it enters 
into an order degraded operation. In this chapter, the performance degradation of 
the reduced-order QRD LS is studied. The optimal residual estimation under faulty 
situation is also considered in this chapter. 


5.1 Order Degraded Performance 

Consider a order p LS problem with a n xp complex-valued data matrix ,4 p (n) denoted 

by 

A p {n) = [u(l), tx(2), • • • , u(n)] T = (a(l), a(2), • • • , a(p)] = [A v . x {n) : a(p)], (5.151) 

an xl desired response vector 

j/(n) = [d(l),d(2),...,d(n)] T , 
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a p x 1 weight vector 


u/ p (n) = [u>f(n),u;f(n),---,u;£(n)] T = [i «p_i| p (n) : w£(n)] r , (5.152) 

and a n x 1 residual vector 

c p (n) = Ml), e p (2), • • • , e p (n)] T = A p {n)w p (n) - y(n). (5.153) 

Let the index of performance be defined by the weighted / 2 norm of 

U n ) = lkp(^)!lA = l|Ae(n) p || 2 = e p (n)A(n)e p (n), (5.154) 

where 

A(n) = 

with a real-valued forgetting factor 0 < A < 1 and A = A 2 . Then the least-squares 
solution, satisfies 

fp m .„( n ) = min |Up(n)|ft = ||A p (n)tr p (n) - y(n)\\\. (5.155) 

The optimal weight vector can be obtained by solving the normal equation 

<t> P {ri)m P {n) = ^p(n), (5.156) 


where 

<M n ) 


A"_ 1 (n)AA p _ 1 (n) 
A p (n)AA p (n) = 

q£ (n)AA p _i(n) 


<f> P - i(n) : i(n) 

r9 w (n) : T](n) 


A"_ ,(n)Aa p (n) 
<£{n)\a p (n) 

(5.157) 
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and 


<M») = A"(n)Ky_(n). (5.158) 

It can be easily shown that 

= M n )\\l - V’fHMp(n). (5.159) 

For a order p — 1 LS problem, then as before, we want to minimize the weighted / 2 
norm of 

£ P -i(«) = [e p _i(l),e p _i(2),---,e p _i(n)] r = A p _!(n)u> p _i (n) - y(n) (5.160) 
with weihgt vector 

t£ p _i(n) = [to?" 1 (n),u^" 1 (n),---,twJ: 1 l (n)] T . 

Obviously, the optimal weight vector of order p and p — 1 can be obtained as w p (n) = 
4>~ 1 (n)i/’ p (n) and w p ^\(n) = <j> p li{n)xl> p _i(n) respectively. The difference vectoroi the 
optimal residual vectors of different order is defined by 

A(n) = i p (n) — £ p _i(rc) = /4 P -i(n)[ib p _ llp (n) - jk p _,(n)] + A£(n)a(p), (5.161) 

and the weighted / 2 norm of A(n) is defined to be the order degraded performance, 
T(n), given by 

F (n) = \\A(n)\\l 

= &w H (n)<f> p _i(n)Aw(n) + ||u£(n)|| 2 • \\a{p)\\l 

+2 • Re[w p p {n)Am H (n)A^Aa(P)l (5-162) 

where A m H (n) = m p _ }lp (n) - 
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To relate w p _ 1 | p (n) with tb p _i(n), we define a LS problem of order p — 1 with 
j4 p _x as the data matrix and a(p) as the desired response. That is, we would like to 
minimize the weighted / 2 norm of the residual vector 

£a, P -j(n) = A p _ 1 (n)w ap _ 1 (n) — a(p) = [e a , p _i(l), e 0iP _i(2), • • • , c BiP >i(n)] r . (5.163) 

From (5.153) and (5.156), the optimal weight vector can be obtained by solving 




(5.164) 


where 9{n) is defined in (5.157). Let the optimal index of performance of this LS 
problem be £ a , P -i(rc) = ||Lj, P -i( n )llA» ^ en $ P 1 can t> e represented by 


<t> P l i n ) = 


<f> P - iM : Qp-i 


C, 


+ 


1 


€a,p— 1 ( n ) 




1 


[~^ P -i( n ) !]■ ( 5 - 165 ) 


Then w p (n) can be represented by expressions of order p — 1 as 


m P (n) = 4> p '{n)rl> p (n) = 4> p \n) 


AS-xin) 


Q. H {p) 


A y{n) 


4>p-M)A v -M) + — 


-ii(p-i( n ) 

£a,p— 1 ( n ) 


A yin), (5.166) 


and therefore, 


, , . , ^ , A n ) - , x 

Mp_i| P (n) = i£ p _i(n) + 7 — HU,p-i{n), 


<>o, P -l ( n ) 


= 


I(n) 

Up- \( n )' 


(5.167) 

(5.168) 
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where I(n) =< ia, P _i(”), y{n) >\ and < x,y >\=< Ax, A y > is the weighted inner 
product. Thus, 

Al(n) = T^pr ie,, p-i(n). (5.169) 

Now, we may proceed to calculate T(n) in (5.162). The first term of (5.162) becomes 


Au; H (n)^p_!(n)Au;(n) 

||I(n' 112 


Op-i(") 

ll^( n )ll 2 

e.Vi(") 


w. 


,p_i(n)A^_i(n)Ai4 p _i(n)ui« iP _ 1 (n) 


(ia. P -i(«) + «(p)) H A(e 0iP _ 1 (n) + a(p)) 


- 72 rT^ a 'P-*( n ) + ’K* 1 ) + 2 ' Re ^ A,p-l( n )'^(?) >a)). 

Sa,p-l( n ) 

The second term of (5.162) becomes 


(5.170) 


KWH’ • Np)II* = JPP 

Ca,p- l(n) 


7/(n). 


(5.171) 


The third term of (5.162) is 


2 • Re[w?(n)Aw H (n)Ap_ 1 Aa(p)] 

= 2-fle[-Ppli !o '' p . 1 (n)/l« 1 (n)Ao(p)] 

Sa,p-U n j 

= -2 ^ - n |^ • Re[{e a ^ p _ l {n) + a{p)) H \a(p)] 

= -2«,(n)-2 iM^e(<A,p- 1 H,,( P )> A) ). (5.172) 

£ a ,p-l( n ) &»,p-l( n ) 

Combining (5.170), (5.171), and (5.172) together, we have 

ll^(«)H 2 II < £a.p-l(n),£(n) >A II 2 


r(n) = 


||2 

11 a 


(5.173) 


4«,p-l(w) Hia,p-l(»») 

Denote the last row of input data matrix u(n) = [it p _i(n), u p (n)], the difference of 
the optimal residual at time n then can be obtained as 


|e p (n) - e p _i(n)|| 2 = ||u T (n)u; p (n) - u p _j («)&„_! (n) 


102 



(5.174) 


ll?(»)ll 2 

£p-i(») 

l|I(n)ll 2 

c.Vi(") 


||«J_i(n)M QtP -i(n) - 
l|ea. P -x(n)|| 2 . 


u p (n)|| 


2 


5.1.1 Geometric Interpretation 


From (5.173), we can see the order degraded performance is indeed the energy of the 
projection of the desired response y(n) onto the subsapce spanned by the optimal 
residual la, p _i(n). As the vector y(n) becomes more orthogonal to £ 0tP _i(n), the 
order degraded performance is also reduced. Denote the column space of A v _i(n) by 
{A p _i(n)}. Then the projection operator Pj , projects vectors onto space {A p _ i ( n )} 
and the orthogonal projection operator P^- —I — Pa p - x projects vectors onto the 
space (Ap_j(n)} which is orthogonal to space {A p _i(n)}. The entire space S spanned 
by y(n) can be represented by 


5 = (V,W) U {/>!„., fl(P» U {/!»), (5.175) 


and all of these subspaces are orthogonal to each other. It is obvious that la iP _i(n) = 
P^ p i a(p). By projecting the desired response y(n) to these subspaces, we obtain 


y( n ) = P^ P _,y(n) + Pu, P -ild n ) + p A P y( n ) 

= §(")+ < L, P - i(n), y(n) > A + £ (n), (5.176) 

Ika.p-lHIlA 

and all of these vectors are also orthogonal to each other. If we drop the vector a(p), 
then the one-dimensional subspace of {P* ,fi(p)} cannot be used to represent y(n). 
Therefore, the components of y_(n) in this subspace is lost and this introduces an error 
vector P^ p _i y( n ) energy 


- 3S ■ rw ' 


(5.177) 
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Figure 5.24: Geometric illustration 

The energy of the last component of this error vector is given by (5.174). Fig 5.24 
illustrates the geometric interpretation discussed above. 


5.2 Residual Estimation in Faulty Situation 

From Section 4.4, we know that if there is a fault occurring in the system, the error 
output at e 0 due to an error 6 generated at an internal cell PE, : is given by 

= “7 II c 'mV, (5.178) 

m=7 

where rj == 6n*=!+i an< ^ the error due to a faulty boundary cell is given by 

e oO\j) = II C ' • II c m • »?j» (5.179) 

(=1 m=j-t-l 

where t] 3 = ( x tn 6 c — \r6 3 )/r' ( with 6 C and <5, representing errors in the numerators 
while r' represents the erroneous content of the denominators of c 3 and s } . Whenever 
e o( n ) ^ 0, we detect there is a fault occurring in the system. The optimal residual 
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e(n) is thus erroneous and is denoted as e 4 (n). In this section, we devise a method 
to estimate the true optimal residual e(n) given the contaminated residual e 6 (n). 


5.2.1 Faulty Internal Cell 

Once the faulty cell PE{ : is identified, then ij in (5.178) can be estimated by 

e o( n ) /c i on\ 

y - T' (5.180) 

i 1 1 m—j 

where 4 s are obtained at the right hand side of the systolic array. If tj is large, then 
through many non-linear operations in the cells, it is essentially impossible to keep 
track of the effect of rj on e s (n). Thus, for analytical tractability, we assume the error 
r) is small in the sense that r]x p <C r. With this assumption, the erroneous cosine can 


be approximated by 


A r 


c j = 


yj^r 2 + (x + T ]) 2 
A r Ar 


y/r ri -f 2rjx r 1 + rjx 


Cj T]X 

rr ~ C, = C, — T7S, 

1 + J r 1 J 1 3 


(5.181) 


and the erroneous sine is given by 


= \A - c ? - - c2 :) + 2 VSjCj 

— Sj + T]SjCj. 


(5.182) 


And since t/x p r, we also have rjsj <Si 1 and r/SjCj <c, < 1 which yield c' < l,s' < 
1, and c 12 + s 12 ~ 1. 

The erroneous cosine and sine generated by the boundary cell PEj : are then sent 
to the right for processing by the internal cells 1 < k < p — j . The output 
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of these cells are erroneous and are given by 


+ * — C j X S i Ar i,i+fc 

= ( Cj — T}Sj)x — (Sj + V s j c j)^ r j,j+k 

— X out j¥] + k ~~ T l S j C j^‘ r j,j+ki 


(5.183) 


where x is the input data and x out and x' out are the uncontaminated and erroneous 
output respectively. Denote the error size produced by cell PEij as With this 
notation, = r 7, and the error size produced by cell PE JJ+ k is 


3j.j+fc = Ssj.i'jSjCjXrjj+k, 1 < k <p- j. 


(5.184) 


Recall that O(r) ~ 0(x/\/l — A 2 ) and s ~ y/\ — A 2 , we have 0($ JiJ+l t) ~ 0(771). As 
discussed above, the contaminated rotation parameters generated by the boundary 
cell PE j+k ,j+k, are given by 


(5.185) 


C 'j+k ~ C i+k ~ ^j+k-l,j+kSj+k 

{ s' J+ k = s J+k + % : +k-\j+kSj+kC 3+ k 1 < k<p-j. 

By solving this linear system of equations, we can estimate the uncontaminated sine 
and cosine as 


c :+k — 2^ C ] + k ^]+k-i,j+k "h \j ( C j+k j+k-l,j+k) 2 T 4Sj+*, 


S 3 + k 


- v^=" 


(5.186) 


'J+k 


0 < k < p — j. 


When ^j+k-ij+k ^ we can further simply (5.186) as 


C J + k — c j+k "h S : + k- 


(5.187) 


To obtain J+( , l<k<p — j,k<l<p — j, from the operations of the internal 
cell 


ou 6+*,j+i 
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— ( C j+fe — < ^‘j+k-l,j+kSj+k){ x "h ^ j+k-l,j+l ) — ( s j+k + ^j+*-lj+fc 5 i+fe C J+fc) ■ ^ r j+k,j+t 
= x out J+ k,,+l ~ ( S j+k'^j+k-l,j+k — c j+k)'^j+k-l,j+l ~ '■fj+k-l,j+kSj+kCi+k^ r j+kS$d^) 

A recursive equation is then obtained to compute Qj+kj+t from ^j + k-\,j+i an d ^tj+k- i,j+k 
as 

'Sj+kj+l = Pj+k'Sj+k-l.j+l + 3j+k-l,j+kSj+kCj+k^rj+k,j+l, 

l<k<p — j,k<l<p — j, (5.189) 

where p j+k = Sj +k < 3 j+k _ uj+ k - c J+k . The order of %j+ k ,j+i is that of 

0('Sj+k-l,j+kSj+k C ]+k^ r j+k,]+l)- 

Therefore, 0(5 J+lj+ ;) ~ 0(t]x 2 ). We can also show that 0(^s :+kt]+ i) ~ 0(rjX k ) < 
0(t]x p ). Thus the error size is consistent with the original assumption of the analy- 
izable situation. 

With (5.189), we can recursively compute the error size produced by the cell 
PE PiP+ 1 and that of 3 p . p+] . The erroneous residual can be obtained as 

e 6 (n) = e(n) - 75 PtP+1 . (5.190) 

The optimal uncontaminated residual is then estimated as 

e(n) = e 5 (n) + 7 G p<p+1 . (5.191) 

To update the content of PE J+k j+k, we try to keep it as uncontaminated as possible 
by performing 

r j+k, ]+ k{n + \) = {\ 2 r 2 J+kj+k {n) + {x + $ 3+k - lJ+k ))*, 
r :+k,]+l{ n T 1) = Sj+ k (x + ^ji+fc-lj+j) c j+k^ r j+k,j+l{ n )i 

\<k<p — j,k<l<p — j. (5.192) 
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We summarize the recursive estimation procedure as follows, 


1. Estimate error size i ? = — ^o( n )/(7 FIm=j c ' m ) fr° m the EDA. 

2. Estimate the uncontaminated rotation parameters by using 

- iWj+k - ^Jlk-ij+k + \/Wj+k ~ Zjlk-i, j+ k ) 2 + 

, s j+k - 7 1 “ c ]+k 0 < k<p-j. 

3. Recursive estimation of the error size by computing 

'Sj+kJ+l — Pj + k < 3j + k-l,j+l+ < $‘j+k-l,]+kS] + kC]+k^ r j+k,j+h 1 5: ^ < P~ji ^ ^ £ P — J- 

4. Estimate the output residual from 

e(n) = e l5 (n) + 7^ P , P +i- 

5. Update the contents of the processing cells by doing 

r 3 +k J+ k(n + 1) = (A 2 r 2 :+kj+k {n) + (x + 9j+fc_, j+k))^ 
r j+k,j+l{ n d" 1 ) = Sj+k{ x d" 'Sj + k— 1J+/) — c j+k^ r j+k,j+l{ n )i 

1 < k < p — j, k < l < p — j. 


5.2.2 Faulty Boundary Cell 


For a faulty boundary cell, from (5.179), we can estimate 

* - n» c ‘°n f -? ■ (5 - 193 > 

11/=1 W 1 lm=j + l Sn 

Unfortunately, there is no way to estimate 6 C and <5 3 from rjj. Therefore, we are unable 
to estimate the optimal residual if the faulty cell is a boundary cell. 
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Chapter 6 


Multi-phase Systolic Algorithms 
for Spectral Decomposition 


In this chapter, we propose two multi-phase systolic algorithms to solve the spectral 
decomposition problem based on QR algorithm. The spectral decomposition consti- 
tutes one of the most computationally intensive needs of modern signal processing 
applications. While the QR algorithm is well known to be an effective method to 
solve the eigenvalue problem, there is still no single systolic array architecture that 
can compute the unitary Q matrix readily and perform the QR algorithm efficiently. 
Previous methods based on the Jacobi-like approach required global communication 
or broadcast in computing the eigenvector, and methods using the QR algorithm 
had communication problems among different architectures. In this chapter, we show 
that the Q matrix can be computed easily by using multi-phase systolic algorithms 
and thus the eigenvectors can also be computed without any global communication 
in the array. Two arrays, a triangular and rectangular, are presented to implement 
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the multi-phase algorithms. Details on these multi-phase operations of the QR al- 
gorithm as well as architectural consequences are discussed in the chapter. Efficient 
fault- tolerant schemes for these multi-phase operations are also considered. 

Recent developments in the parallel processing architectures, especially the sys- 
tolic architectures, for the spectral decomposition are discussed in section 6.1. In 
section 6.2, some preliminary matrix operations useful for the multi-phase operations 
are discussed. In section 6.3, we review the QR algorithm and show the evalua- 
tion of the eigenvector from cumulative multiplications of the Q matrices. Then two 
multi-phase systolic arrays for the QR algorithm and the Hessenberg reduction are 
presented in section 6.4. Their performances, numerical stabilities, and convergence 
rates are studied in section 6.5. Finally, some efficient fault- tolerant schemes that can 
be incorporated with the arrays are discussed in section 6.6. 


6.1 Recent Developments 

Computing the spectral decomposition of a matrix is an important issue in many mod- 
ern signal processing and system applications. The feasibility of real-time processing 
for sophisticated modern signal processing systems, depends crucially on efficient im- 
plementation of parallel processing of the algorithms and associated architectures 
needed to perform these operations [14, 58]. While many variations exist in the lit- 
eratures for solving these matrix problems, the heart of all these iterative methods 
are based either on the Jacobi-Hestennes method or the QR algorithm [30, 101, 107]. 
While there are some fundamental differences between these two approaches, both al- 
gorithms have good numerical stability and convergence rate properties and thus are 
desirable for possible implementation. Since present VLSI technology is capable of 
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building a multiprocessor system on a chip, many researchers have proposed different 
parallel processing architectures to solve eigenvalue and singular value decomposition 
(SVD) problems. 

For any complex-valued m x n matrix A, the classical spectral decomposition [102] 
of the n x n Hermetian matrix A H A, is given by 

A H A = £ A.u.uf = V\V H , (6.194) 

«=i 

where V = [v x , ■ • • ,u n ] is an n x n unitary matrix, A = diag[\i, ■ ■ ■ , A„], and H is 
the complex conjugate transpose operator. The A [s are the eigenvalues satisfying 
Ai > A 2 • • • > A n > 0 and the u's are the eigenvectors satisfying A H Av x = A,-u,-. The 
decomposition of A H A follows from the SVD [30] of A given by 

A = USV H , (6.195) 

where U = [uj, •••,«„] is an m x n matrix with orthogonal column vectors, S = 
diag[si, • • • , s n ], and V is an n x n unitary matrix. The s's are the singular values 
satisfying Si > S 2 - > s n > 0 and are the positive square roots of A(s such that 
A = S' 2 . In this paper, we shall use spectral decomposition in the broad sense of 
not only including the decompositions of (1) and (2), but also include the eigenvalue 
decomposition of an arbitrary complex- valued n x nmatrix A given by 

AX = AX, (6.196) 

where X is an n x n matrix of eigenvectors and A = diag[ Aj, ■ • • , A„] is the matrix of 
eigenvalues of A. 

Luk [72], Brent [12], and Gao and Thomas [27], have used effectively the Jacobi- 
like method to solve these problems for either a multiprocessor system or systolic 
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array. The basic problem concerns the diagonalization of a 2 x 2 matrix by the 
rotation matrices J(6) and K{(f>) in 


J(0 ? 


w x 


y z 


K(4>) = 


d x 0 

0 dj 


(6.197) 


A two-stage procedure is then used to find 0 and <f> [72]. To find the SVD of a square 
matrix A, an appropriate sequence of 2 x 2 matrices is computed by using the basic 
Jacobi transformation in 


Tn : A «- J/j AKij 


(6.198) 


where J xj and K t] are rotations in the [i,j) plane chosen to annihilate the ( i,j ) and 
(j,t) elements of A respectively [72]. While the Jacobi-like method, as considered in 
[72], is currently known as one of the most effective parallel SVD algorithm for full 
dense matrices, the computations required to obtain the rotational matrices needed 
in this approach to obtain the singular vectors are not simple and can not be obtained 
without broadcast [72]. Moreno and Lang [82] also considered some alternatives to 
the algorithms in [12]. 

On the other hand, other researchers have used QR algorithm to solve the eigen- 
value problems. Heller and Ipsen [33, 43] performed the QR iteration for banded ma- 
trix based on orthogonal systolic network and Schreiber [94] combined their network 
with Gentleman-Kung's QR array to compute the QR algorithm. These methods 
required the computation of the unitary matrix Q. However, problems exist in the 
concurrent computation of Q and the pipeline operation of the QR iteration [43]. In 
[81], Moldovan et al. studied the mapping of a large QR algorithm onto a fixed size 
array. Torralba and Navarro [103] further purposed a size-independent linear array for 
QR iteration and Hessenberg reduction. While this approach can provide an efficient 
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computation of one iteration of the QR iteration, it is not obvious how to pipeline 
the iteration. 

For some system applications, such as matrix rank determination and system iden- 
tification [52], the efficient computation of singular values is sufficient, while in other 
applications such as antenna beamformation [79, 97], spectral estimation [50, 93], 
direct finding [31, 85], etc., the eigenvectors are crucially needed. This makes practi- 
cal implementation of systolic arrays discussed above difficult for many applications 
since they either cannot compute the eigenvector or cannot obtain the eigenvector 
without broadcast. For example, for the MUSIC algorithm [32], once we determine 
the signal subspace and noise subspace from the eigenvalues, the sample spectrum is 
then determined by 

5 ^ = s H {u)x N x%s{ijy 

where X N is the matrix of eigenvectors which generate the noise subspace and 

with K is the dimension of matrix Xjv. A system which consists of several systolic 
modules to compute the MUSIC algorithm has been proposed in [90], However, com- 
munication problems among the modules and the difficulty of matching the pipeline 
rates and timings among different modules may pose difficulties for practical imple- 
mentation. 

Presently, there is no known simple efficient systolic array approach for the gener- 
ation of eigenvectors. The main reason is that there is no single architecture that is 
capable of handling all the steps required in the algorithm such that we can pipeline 
the successive iteration readily. The communication cost among different architec- 
tures is high and the interface problem for an efficient data flow is demanding. In this 
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paper, we propose two multi-phase systolic algorithms to solve the spectral decompo- 
sition problem based on the QR algorithm. By multi-phase operations we mean that 
the processing cells can perform different arithmetical operations in different phase 
of the computations. Two systolic arrays, one is triangular and the other is rectan- 
gular, are designed based on the multi-phase concept. A key feature in our method 
for the successfully application of the QR algorithm is that the Q matrix of the 
QR decomposition can be computed explicitly by multiphase operations. With the 
proper feedback of this Q matrix, the QR algorithm can be computed and pipelined 
effectively in a single systolic array. From the accumulation of those Q matrices in 
another array, eigenvectors and singular vectors can be computed without needing 
global communication inside the array. 


6.2 Systolic Array Matrix Processing 

In this section, we consider some preliminary matrix and associated systolic array 
operations needed in the multi-phase systolic algorithms for spectral decompositions. 
A. QR Decomposition 

A non-degenerate m xn rectangular matrix A can be factorized into two matrices 
Q and R such that A = QR , where Q is an m x m unitary matrix and R is an m x n 
upper triangular matrix. The matrix Q can be computed using sequences of Givens 
rotations. An elementary Givens transformation has the form of 


c s 


0 • 

• 0 r, r l+1 

. . . r k 


0 • 

• 0 

r 'i r ,'+l • 

•• r 'k 

— s c 


0 • 

■ 0 x, x 1+1 

■■■ x k 


0 • 

■ 0 

0 x' +1 • 

“ X ' k . 


( 6 . 199 ) 
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Figure 6.25: Triangular systolic array for QR decomposition. 


where 


'rf + xf 


<rj + x? 


Several different QR arrays have been considered by Gentleman and Rung [26], Heller 
and Ipsen [31], and Luk [67]. In particular, the computation of the Q matrix without 
broadcast is difficult for the array considered in [66, pp.266]. On the other hand, [26] 
has shown that a triangular systolic array can be used to obtain the upper triangular 
matrix R based on sequences of Givens rotations. This approach also leads to an 
efficient method for performing recursive least-squares computation [72], and is also 
useful for finding the singular value of a matrix [25]. This systolic array is shown in 
Fig. 6.25 and the operations of the cells are described in the first column (i.e., phase 
1) of Table 6.5. While the rotation parameters are propagated to the right, the Q 
matrix will not appear directly at the right as originally suggested by [86]. In order to 
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Table 6.5: Operations of the processing cells for different phases. 


demonstrate this point, denote as the Givens rotation matrix of the (i,j) plane, 
then matrix Q can be obtained as 

1 m 

Q T = LI II G <v (6.200) 

» = T7l — 1 j = i + l 

where LI is an ordered matrix product such that U.=m-i = C m _iC m _2 • ■ • C\, while 
n denoted a conventional ordinary product. From Table 6.6, we can see, for a n xn QR 
triarray, the first rotation parameter coming out at the right edge occurs at time n + 1. 
After that, rotation parameters for different plane rotation come out successively. If 
assuming that the operation of fl discussed above can be obtained immediately, then 
there are m operations of LI when all of fl are available. This observation leads to 
the conclusion that we cannot obtain the Q matrix by cumulatively multiplying the 
rotation parameters propagated to the right edge. Thus, This is not an effective way 
to obtain the Q matrix. 
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Time 

n+1 

n+2 

n+3 

n+4 

n+5 

n+6 

First 

row 

(1.2) 

(1.3) 

(1.4) 

(1.5) 

(1.6) 

(1.7) 

Second 

row 



(2,3) 

(2,4) 

(2,5) 

(2,6) 

Third 

row 





(3,4) 

(3,5) 


Table 6.6: The timing table for the rotation parameters to reach the right edge of the 
QR triarray. 

B. Computation of R~ T x 

In [7], Comon and Robert presented a rectangular systolic array for the computa- 
tion of B“M, where B and A are square and rectangular matrices respectively. The 
computation takes two phases. First, the matrix B is fed into the array and B -1 
is computed. In the second phase, the matrix A is inputted to produce B~ X A. For 
the special case where B is an upper triangular matrix denoted by R, instead of a 
full dense matrix, McWhirter and Shepherd [73] used the property that a triangular 
array can compute R~ T x in one phase with the matrix R prestored in the triarray. 
Since this property is needed in phase 2 of our work, and no derivation was given in 
[73], we present a brief derivation of this result. Define r,y = {R)ij and r[- = (B -1 ) tJ , 
where r tJ = 0 and r[- = 0 for i > then it can be shown that 

, i/ r «. * = 

r b = \ 

- Ei=, r' ik r kj /r :: , i < j < n. 
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Let [t/i, • • • , y„] T = R T x, then 

V: = 22 *« T y > J = 1, • • • , n. (6.202) 

1=1 

In particular, y* can be expressed in terms of r tJ as 

j j-i j-i 

w = -■{*]- 22 x < Ewj) 

r ii i=i Jfc=i 

= — (6.203) 

r ii fc=i i=i 

From (6.202), yj is given by 

1 } ~ x 

Vj = — (xj - £ ykr kj ) (6.204) 

k= i 

(!) " ^ 

Thus, y, can be computed recursively according to the above equation in the following 
algorithm: Fig. 6.26 shows the data flow of the input x and the output y. 
Recursive Algorithm for Computing y = R~ T x 

V' = £ • 

for j = 2 to n 
begin 

z : = x : 

for k = 1 to j — 1 
Z j = Z ] — Vk r k] 
yj = ~j/ r n 

end 

The corresponding systolic array to implement the above algorithm is the same 
as the one shown in Fig. 6.25. The operations of the cells are shown in the second 
column of Table 6.5. The first part of the (6.204) , (i.e., the division), is performed 
by the boundary cell while the second part of (6.204) is cumulated by the internal 
cells. With R pre-stored in the triarray, Fig. 6.26 shows the data flow of the input x 
and the output y. 
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Figure 6.26: Computation of R T x using a triarray. 

C. Triangular-Matrix Multiplication 

The multiplication of a triangular matrix R and an rectangular full dense matrix 
B is given by 

(C) i: = (RB) i: = j2r ik b kj , (6.205) 

k=i 

where r tk and b kj are elements of matrices R and B. Using the same array as in 
Fig. 6.25, with R prestored in the triarray and the operations shown in the third 
column of Table 6.5, this multiplication can be easily obtained if B is inputted row 
by row as in Fig. 6.27. 

D. Matrix Multiplication 

There are many ways to implement a full matrix-matrix multiplication in systolic 
array [52]. In Fig. 6.28, we show a typical architecture that can be incorporated with 
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Figure 6.27: Multiplication of a triangular matrix and a full dense matrix. 

the multi-phase operations to obtain eigenvectors. With input matrices Q and A 
arranged as in Fig. 6.28, the matrix B T , where B = AQ, will sit in the rectangular 
array when the computation is completed. Details on this issue will be discussed in 
later sections. 


6.3 QR Algorithm 

In this section we review briefly the basic operation of the QR algorithm and show 
the evaluation of the eigenvectors from the cumulative multiplication of successive 
Q matrices. For a complex-valued n x n matrix A , it states that there is a unitary 
transform U such that R = U AU H is a upper triangular matrix with diagonal eigen- 
values of descending order. This follows from the QR Algorithm [27, 92, 97] where 
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Figure 6.28: Matrix-matrix multiplication in a rectangular array. 

by setting A\ = A, we have A k = Q k Rk and A k+X = R k Q k = Q k A k Q k , k = 1, • • •, 
with unitary Q k and upper triangular R k . Furthermore, A k converges to the upper 
triangular matrix with diagonal eigenvalue elements. However, it is not obvious how 
to compute the eigenvectors from those Q k and R k we have calculated. With similar 
derivations as in [97], here we shows how to obtain the eigenvector associated with the 
largest eigenvalue from cumulative multiplications of Q k . From the above discussions, 
we have 

"djt+i = Q k Q k - 1 • • • Qi ^\Q-[Q2 ■ • • Qk- (6.206) 


Define 

k 

Qk = 1_I Q, = Q 1 Q 2 • • • Qk , 

1=1 
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(6.207) 


Rk = IJ Ri = RkRk-i ■ ■ ■ R\- 


i~k 


Then we have 


QkAk+i = AiQk- 


(6.208) 


Thus the multiplication of Q k R k can be expressed as 

QkR-k = Q\Qi • • • QkRkRk-i • • • Ri 

= Qk-\A k R k -\ = AxQk-\Rk-i — Ax = A k . 


(6.209) 


Let the eigenvalues of A satisfy, |A t | > |A 2 | > ••• > |A n |. Denote the matrix 
eigenvectors and eigenvalues of A by X and A respectively. Then A k is given by 


A k =XA k X~\ 


( 6 . 210 ) 


Let the QR decomposition of A' be A' = QR and the LU decomposition of X~ l be 
X~ x = LU , where L is an unit-lower triangular matrix. Then 

A k = QRA k LU = QR(A k LA~ k )A k U, (6.211) 


where 

A k LA~ k = / + £ fc , (6.212) 

and 

{E k ), : = 

I 0, otherwise. 

Therefore, when k is large enough, we have lim*.^,*, E k = 0 and thus A k LA~ k ap- 
proachs the identity matrix. Then (6.211) can be rewritten as 



A k - QRA k U. 


(6.214) 
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Since the term RA k U is an upper triangular matrix, comparing to (6.209) we can 
see that Qk — ► Q when k is large. That is, the Q matrix of the QR decomposition 
of A k approachs to that of the Q matrix of the QR decomposition of the matrix of 
eigenvector A'. Define 


Qk = 



A = 

[Zl)£2 1 • ' • tin]: 

(6.215) 


and r,j as the (i, j) element of R. From Qk —* A, we find 7*ng — ► Xj when k is 
large. Since X! is the eigenvector associated with the largest eigenvalue, we conclude 
that the first column of the matrix Qk approach the eigenvector associated with the 
largest eigenvalue of matrix A when k is large. If the matrix A is symmetric, which 
is often the case for many signal processing applications, the similar transformation 
Ak + 1 = Qk AQk is also symmetric. Since Ak+i approaches the upper triangular matrix 
by the QR algorithm, Ak+i approaches a diagonal matrix. That is 

A k — A, (6.216) 

and 

Q k A -> A. (6.217) 

In this case, for large k, the columns of Qk become proportional to the columns of 
eigenvector in A'. 

If A is real, then Ak will converge to a real block upper triangular matrix with 
lxl and 2x2 main diagonal blocks. The complex conjugate pairs of eigenvectors 
of the 2x2 blocks can be solved easily using the quadratic formula. When A is not 
a square matrix, the singular values and vectors are of interest. For a m x n matrix 
B , where m > n, the SVD of B shows B = UY1V T , where U is a m x n matrix of 
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Figure 6.29: A circular multiplexer. 


orthogonal columns, V is a n x n unitary matrix, and £ is a n x n diagonal matrix 
with diagonal elements called singular values given in descending order. For most 
situations where high condition numbers are not encountered, a simple symmetric 
n x n matrix C = B T B can be formed and the matrix V can be found by direct use 
of the QR algorithm. Similarly, U can be found by using D = BB T . 

6.4 Multi-phase Systolic Algorithms 

In this section, we introduced the multi-phase systolic algorithms to compute the QR 
algorithm. Two arrays, triangular and rectangular, can be used to compute the QR 
algorithm with some advantages and disadvantages for each. We shall show that our 
methods compute the Q matrix explicitly without requiring any global communication 
within the array. Before we consider the multi-phase algorithms, two communication 
switches are first discussed. A circular multiplexer is a device which takes its inputs 
and distributes them in different output positions as shown in Fig. 6.29. We use a 
skewed row to represent the circular multiplexier. A first in/first out (FIFO) buffer 
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Input 


Figure 6.30: A first in/first out buffer. 


is a buffer which takes its input to output in a first come first out manner as shown in 
Fig. 6.30. Both devices are commonly used in computer and microprocessor systems 
for data arrangement [36]. The computation of a QR algorithm consists of two basic 
steps. Initially, set A\ = A. 


(1) for k = 1, 2, • • • , compute Ak = QkRk\ 


(2) compute Ak+ 1 = RkQk, stop if converge, otherwise go back to step (1). 


6.4.1 Multi-phase Triangular Systolic Array 

The QR Decomposition triarray proposed by Gentleman and Kung [26] is used in our 
approach. The R matrix is stored in the triarray after the computation. To compute 
the matrix Ak + 1 in step (2), the Qk matrix has to be computed first. Let us call the 
computations in step (1) and step (2) an iteration. Several iterations are required for 
Ak to converge. For each iteration, we propose a three phase operation on a triarray 
as follows: 
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• Phase 1: QR decomposition for Ak 

Compute the QR decomposition of the matrix Ak = QkRk, with the upper 
triangular matrix Rk being stored in the triarray [29]. The data in Ak is inputted 
row by row skewed in time as shown in Fig. 6.31. 

• Phase 2: Computing the Qk matrix 

From the QR Decomposition, we have R^ T Aj. = . Let the i th column of 

matrices A J and Ql be denoted by a, and q. respectively. Then 

RZ T h\,Zi ,- •*,«„] = [£i ’£ 2 ’“ •’£„]• ( 6 . 218 ) 

Section 6.2 showed that R^ T x can be computed in a triarray same as the one 
used in Phase 1. Since the i th column of A[ is the i th row of Ak, then with Ak 
inputted row by row skewed in time as shown in Fig. 6.32, the operations of 
the processing cells are given in the second column of Table 6.5. The triarray 
computes the Qk matrix of Ak . The matrix Qk is then outputted row by row 
as shown in Fig. 6.32. In order to start Phase 3, the matrix Qk has to be in 
the form of Fig. 6.33. Observe that the output Qk of phase 2 shares the same 
snap-shot order as the desired arrangement of Qk in Phase 3 after a transpose 
operation. A circular multiplexer is used to distribute each column output of 
Qk into row input as indicated in Fig. 6.32. 

• Phase 3: Computing RkQk 

With the operations of the processing cell as shown in the third column of 
Table 6.5 and the Q k obtained in Phase 2, Fig. 6.33 shows the computation 
of Afc +1 = RkQk in the triarray. Then the matrix Ak+i comes out column 
by column from the right side of the triarray. Again, we observe that Ak+\ 
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Figure 6.31: Phase 1: The QR decomposition. 
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Figure 6.32: Phase 2: Computing the Q matrix. 
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Figure 6.33: Phase 3: Computing the matrix product RQ. 
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shares the same snap-shot order as the desired arrangement of Ak in Phase 1 
after a transpose operation. If not convergent, a new iteration is repeated by 
feeding back A*+i into the triarray after using a circular multiplexer as shown 
in Fig. 6.33. Then Phase 1 operation begins as in Fig. 6.31. 

An attractive property of this multi-phase operation is that the feedback require- 
ment of the matrices in different phases are identical. Thus, only a circular multi- 
plexer is needed for each row outside the array. Observe that each column of the 
matrices inputted in all of the phases need n time steps to process and the next phase 
can be started at time n + 1. We find once the data outputted at the right hand 
side of the triarray, after passing through the circular multiplexer, it can be piped 
into the array for the next phase computation without suffering any delay. If we as- 
sume the multiplexer is ideal such the delay in the multiplexer can be ignored, it takes 
3n-f(2n — 1) = 5n — 1 system clocks for one iteration. The (2n— 1 ) term represents the 
initial time to feed the data into the array. Suppose the number of iteration required 
for convergence is 5, then the total number of system clocks needed is 35n-|-(2n— 1). 
Thus, the converge rate of this algorithm is of the order of 0((3S + 2)n). After the 
convergence of the Ak matrix, those values on the boundary cell are the eigenvalues 
of the A matrix. 


6.4.2 Multi-phase Rectangular Systolic Array 

The above method requires the use of the R~ T operation in the computations. From 
a numerical stability point of view, we may want to consider an alternative that uses 
a square matrix for cumulative multiplication of the rotation parameters. Fig. 6.34 
shows a square matrix which is an extended version of the Gentleman-Kung’s trian- 
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gular array with two delay elements (represented by black dots in Fig. 6.34) in the 
vertical communication links of the lower triangular part of the array. The processors 
in the lower triangular part are identical to the internal cells in the upper triangular 
part. Denote 

[A/ /I] = • ■ • <^;&] (6.219) 

as the parallel combination of matrix A and /, where A = [a l5 a 2 , • • • , a n ] and I is 
a n x n identity matrix with e, as its i th column. The square array takes the input 
[A/ /I], while rotating A into an upper triangular matrix, / is used to cumulate the 
rotation parameters by 

QIA//I] = [R//Q], (6.220) 

Noted that processors in the upper triangular part not only rotate the matrix A but 
also cumulatively multiply the rotation parameters with I. Thus, its work load is, 
in general, twice as that of Gentleman-Rung’s internal cell. Processors in the lower 
part, on the other hand, only cumulate Q from the propagated rotation parameters. 
A two phases operation for QR iteration is proposed as follows: 

• Phase 1: QR decomposition 

Compute the QR decomposition of matrix Ak = QkRk', both Qk and Rk are 
obtained and stored. Then each row of Qk is piped out and fed back to the 
array through a FIFO buffer as shown in Fig. 6.34. 

• Phase 2: Computing RkQk 

In this phase, the operation is identical to that of the Phase 3 in the triangular 
array. A circular multiplexer is used to transform A/t+i from row output into 
column input. Continue this iteration until converged. 
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Due to the delay elements at the lower triangular part of the array and the work 
load of each processor (except those in lower triangular part) is twice as that of 
triangular array, the time to obtain the i th row of the Qk matrix, t,, is 

ti = max(2(n + 3 i — 3), 2(2n + i — 2)), 

where 2(n + 3i — 3) is the time for the left-most cell of the i th row to obtain its Q 
element and 2(2 n -f i — 2) is the time for the right-most cell to finish. Obviously, 
when i > t{ = 2(n + 3? — 3). Thus, the time required to obtain the whole Q 

matrix is t n = 8 n — 6. By assuming that it takes time n to sequentially pipe out the 
Qk matrix, this algorithm takes (9n - 6) + n + (2n — 1) to complete an iteration in 
the worst cast. Again, denote the number of iteration as S, this algorithm converge 
in the order of 0(5(1071 — 6) + 2n — 1) = O((10S + 2)n). Of course, the performance 
can be improved by piping out each row of Q matrix when it is available instead of 
waiting for the whole Q matrix is available. With this, the performance can reach to 
the order of 0((9S + 2)n). 

6.4.3 The Hessenberg Reduction 

In order to perform the QR algorithm efficiently in conventional Von Neumann type 
series computers, we usually transform the data matrix A into an upper Hessenberg 
matrix before the QR iteration is started. With this transformation, the amount of 
work per iteration is reduced from 0(n 3 ) to 0(n 2 ) [30]. However, this argument may 
not be true for parallel processing architectures. The reasons are two folds: 

1. Due to the hardware resource in a parallel processing architecture, the compu- 
tations can be performed concurrently without hindering the processing time. 
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For example, the computation times of the two multi-phase arrays discussed 
above are of the order 0(n). 

2. The data matrix is usually not of the Hessenberg form. The pre-processing of 
the data matrix to Hessenberg form may not be able to be incorporated with the 
following computations. That is, the pre-processing must be done separately. 

Both reasons may lead to the conclusion that the Hessenberg form is not of inter- 
est and practical for parallel processing of the QR algorithm unless the Hessenberg 
form can be obtained easily by using the same parallel processing architecture. Many 
papers prevented to answer this question by assuming the Hessenberg form (or the 
tridiagonal form) is already available. Fortunately, the Hessenberg form can be ob- 
tained easily as an additional part of the multi-phase operations. 

To obtain the Hessenberg form, we can choose an unitary similarity transformation 
U such that A\ = U H AU is a upper Hessenberg matrix [30]. The transformation U 
can be obtained from sequences of Givens rotations. Denote G , as the product of 
the Givens rotation matrices which zero out the proper positions of the i th column. 
Since the first i th rows will not be affected by G t , the matrix G , is of the form G, = 
diag(I t ,Gi), where /, is an identity matrix of dimension i. Suppose the Hessenberg 
form through its first k — 1 columns have been obtained 

#11 B\ 2 #13 

(G l --G k . l ) H A(G l ---G k . i ) = B 2l B 22 # 23 , (6-221) 

0 #32 #33 
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where B n and B 33 are (k — 1) x (k — 1) and (n — k) x (n — k) matrices respectively. 
Then 


B\2 

B l3 G k 

B 22 

BizGk 

G%B 3 2 

G" B^Gk 


(G i ---G k ) H A(G 1 ---G k ) = 

is a Hessenberg form through its first k columns. Thus 

A 1 = (G l ---G n _ l fA(G 1 ---G n . l ) 

is an upper Hessenberg matrix and 


(6.222) 


(6.223) 


U = (?!••■ G„_i = 


1 0 T 

Q G 


(6.224) 


Denote 


,4 = 


T 

a 


1 

o T 


T 

a 


= UA = 




A 


0 

G 


R 


(6.225) 


where A = GR and R is of the form of an upper triangular matrix without the 
lowest vertex element. Then A t = AU = U H AU. Obviously, this is similar to the 
computations in the QR iteration. To obtain G and R , let 

A 


The QRD of A is 


A = 


A-QR- 


0,0,-.., 1 


' G Q t ' 


R 1 

. Q 1 


1 

O 
C D 

K— * 


(6.226) 


(6.227) 


Now, we can use the multi-phase operations to obtain the upper Hessenberg form. 
We call this the Phase 0 operation. 


Phase 0: The Hessenberg Reduction. 
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(1) Phase 1 Operation: QRD of A. 
from A = QR , we obtain R in the triarray. 

(2) Phase 2 Operation: Computing G matrix. 

From Q = R~ T A T , we obtain matrix G. 

(3) Phase 3 Operation: Computing the Hessenberg matrix A k . 

By forming 

T 
a 

A = 

R 

and 

' i Q T 
U = 

0 G 

we obtain the Hessenberg matrix A\ = AU. 

6.4.4 Computing the Eigenvectors 

To compute an eigenvector, a matrix multiplication systolic array can be incorporated 
with the multi-phase array such that those matrices Qi, ■ ■ ■ ,Qk are cumulated to form 
the Q ) t matrix. Noted that Q k = Q k -iQk and the matrix Qfc-i is available at the start 
of the k th iteration, while the matrix Q k coming out at Phase 2 operation of the k th 
iteration. Then Qk is obtained by multiplying Q k ~i and Q k as shown in Fig. 6.28. A 
system configuration for triangular array is shown in Fig. 6.35 and that for rectangular 
is shown in Fig. 6.36. As discussed in section 6.3, for a symmetric A matrix, when 
A k converged, Q k yields the matrix of eigenvectors. For a non-symmetric A matrix, 
the first column of Q k yields the eigenvector associated with the largest eigenvalue. 
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Figure 6.35: System configuration of the multi-phase triarray. 



Figure 6.36: System configuration of the multi-phase rectangular array. 
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Triangular 

array 

Rectangular 

array 

Computation 

time 

0((3S+2)n) 

O((10S+2)n) 
worst case 

Numerical 

stability 

fair 

stable 

Number of 
cells 

nfn+l'l 

2 

E? plus 

(n 2 -n) d-elements 

I/O ports 

2n 

3n 

Utilization 

1 

<1 

Communication 

devices 

1 

2 


Table 6.7: Comparisons of the multi-phase triarray and rectangular array. 

6.5 Performance Efficiency 

6.5.1 Comparisons of the arrays 

Although there are three phases of operations, the arithematical operations in Phase 
2 and Phase 3 form a subset of the operations executed in Phase 1. Therefore we do 
not increase the cell complexity in the multi-phase arrays. The performance and char- 
acteristics of both triangular and rectangular arrays considered above are summarized 
in Table 6.7. The advantages of the triangular array are: it has less computational 
time as well as less number of cells, I/O ports, and communication devices. Further- 
more, all the the processing cells are fully utilized. However, due to the computation 
of R~ t in Phase 2 of the operations, it may be numerical unstable for certain highly 
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ill-conditioning data. For example, consider the matrix given by 

0.7601 -0.3967 0.6060 

-0.3967 1.7475 -0.1962 

0.6060 -0.1962 0.4924 

with eigenvalues {2.0, 1.0, 10 -12 }. If the triarray algorithm, which uses R~ T to obtain 
Q T , is used, the eigenvalues are obtained as {2.0,1.0,3.6818 • 10 -6 }. On the other 
hand, eigenvalues are obtained as {2.0, 1.0,9.9999 • 10 -13 } if the R~ T is not explicitly 
computed. As a result, we have a complexity versus numerical stability tradeoff for 
the two multi-phase arrays. 

6.5.2 Rate of Convergence 

Similar to Luk in [72], by convergence we mean that the parameter of f(Ak) defined 
as 

®//(40 = S ' <, jv °' J(tl . (6-228) 

where N is the number of off-diagonal elements, has fallen below some prechosen 
tolerance value. As indicated in [72], it is difficult to monitor of f(Ak) in the parallel 
computation. Luk then proposed that the iteration be stopped after a sufficiently 
large number S of iterations. In the studies of Brent and Luk [12, 72], they found 
that S < 9 for random symmetric matrices of order n < 230 and S < 6 for n < 24. 
Therefore, they chosen S = 10 for n < 100 for Jacobi-like method. Similar to their 
approach, we apply the QR algorithm to random nxn symmetric matrices (a, ; ), where 
the elements a tJ for 1 < i < j < n were uniformly and independently distributed in 
[—1,1]. The tolerance to meet the stopping condition is off(Ak) < 10~ 10 . We can 
see from Fig. 6.37 that the number of iterations for a QR algorithm to converge is in 
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Figure 6.37: The number of iterations for a QR algorithm to converge versus the 
matrix size. 


the order of 10 for matrix size smaller than 20 x 20. Even though we can reduce the 
matrix to Hessenberg form for full dense matrix or tridiagonal form for symmetric 
matrix, and the QR iteration with origin shift can accelerated the convergent rate 
[32, 91, 98], the number of iteration is still in the order of 10. As an example, the 
4x4 tridiagonal matrix 

12 0 0 
2 3 4 0 
0 4 5 6 
0 0 6 7 

still requires eight iterations to converge when the most elegant symmetric QR al- 
gorithm, the Symmetrix QR Algorithm, was used [27, pp.424] This kind of property 
is not desirable for parallel processing implementation. It is known that Jacobi-like 
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method may require more flops as compared to the symmetric QR algorithm. How- 
ever, due to parallel implementation, many rotations may take place at the same 
time. The computations involved in QR algorithm and Jacobi-like method are gen- 
erally of the same complexity. From these discussions, the one which requires less 
number of iterations is more attractive from the parallel implementational point of 
view. Furthermore, the convergence rate of an QR iteration depends on the ratio of 
the eigenvalues. In our simulations, in more than 10% of the cases, the randomly 
generated symmetric matrices required significantly larger number of iterations to 
converge. As pointed out before, it is difficult to monitor the quantity off(Ak) to 
decide when the algorithm converges in the parallel computation. Since the conver- 
gence rate is highly dependent on the ratio of eigenvalues, there is no general rule 
to choose a sufficient number of iteration 5 like Jacobi-like method to insure conver- 
gence. This is also an undesirable intrinsic property of the QR algorithm for parallel 
implementation as compared to that of the Jacobi-like method. 

6.6 Efficient Fault-tolerance Schemes 

Reliable implementation is quite essential in parallel processing architectures. For 
a complex parallel processing system, a single fault from any part of the system 
can make the whole system useless. For various critical applications using spectral 
decomposition, highly-reliable computations are demanded. Fault-tolerance is there- 
fore needed in many of these problems. A simple and cost effective fault- tolerant 
scheme is the checksum and weighted checksum proposed by Abraham et al [2]. This 
scheme is one of the typical examples of the algorithm-based fault-tolerance which has 
been applied to various signal processing and linear algebra operations [65]. Define 
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the checksum vector e T = [1, 1, ■ • • , 1], the column, row and full checksum matrices 
A c , A t and A] of a square n-by-n matrix A are defined as 

A 

A c = 

e T A 

A, — A Ae ■> 

- 

A Ae 

A f = 

e r A ej Ae 

If there is any fault occurring during the computation, the checksum criterion is not 
met and thus the fault is detected. The weighted checksum scheme can be further 
used to correct errors [47]. It has been suggested in [17] that checksum/ weighted 
checksum scheme can be incorporated into the QR iteration for error detection. These 
properties as well as others are considered here for the multi-phase arrays. 

Since there are different operations in different phases, the inherent natures of 
the operations of each phase is thus different and should be examined for possible 
fault-tolerant implementation individually. The fault-tolerant schemes for each phase 
of the multi-phase triangular array are given as follows: 

• Phase 1: As pointed out in [17, 65], row checksum is invariant for the QR 
decomposition. It can be seen 

= A Ae = Q R Re = QRr- (6.229) 

This means that the QR decomposition of a row checksum matrix results in a 
row checksum upper triangulat matrix. Fig. 6.38 shows the implementation of 
this scheme. 
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Figure 6.38: Row checksum for A r = QR r . 
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Figure 6.39: Column checksum for R T A? = . 


144 



• Phase 2: Due to the nature of computations in this phase, row checksum is no 
longer valid. Fortunately, column checksum is possible as given by 

R- t A t c = R~ t [ a t A T e ] = [ Q T Q T e ] = Ql (6.230) 

An implementation of this scheme is shown in Fig. 6.39. 

• Phase 3: Although a row checksum upper triangular matrix R r and a column 
checksum unitary matrix Q c are obtained in the above phases, unfortunately, 
R r Q c does not yield any relevant use. By defining the trace operation as the 
sum of the diagonal elements in a square matrix, we obtain Tr[AB) = Tr{BA\, 
where A and B are square matrices. Therefore, 

Tr[A k+1 ] = Tr[R k Q k ) = Tr[Q k R k ) = Tr[A k ) =£\ t , (6.231) 

i=i 

where A; is the eigenvalue of matrix A k . This invariant property can be used 
to check the result of the Phase 3 operation. Once the trace of A k is different 
from the trace obtained before, a fault is then detected during the Phase 3 
computation. 

For the rectangular array, the Phase 2 operation is the same as the Phase 3 
operation of the triangular array. For its Phase 1 operation, an interesting feature of 
this computation is given by 

Q\A,HI,} = [RrllQA- (6.232) 

That is, a row checksum of the parallel combination of matrices A and / gives a 
row checksum of the upper triangular matrix R r and a row checksum of the unitary 
matrix Q r . 
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Chapter 7 


Conclusions and Future Research 


This dissertation has addressed various aspects of the RLS problems and spectral 
decomposition algorithms based on QRD approach. 

In Chapter 2, we have shown that the Householder transformation can be im- 
plemented on a systolic array. By using a two-level pipelined implementation, the 
throughput of the SBHT RLS systolic array can be made as fast as that of the origi- 
nal Givens array in [78]. While the system latency is longer for the SBHT, it provides 
a better numerical stability than the Givens method. Clearly, the Givens array is a 
special case of the SBHT array with a block size of one. In general, the block size 
is an important variable. A larger block size results in a better numerical stability, 
while the system latency is increased. Many known properties of the Givens array 
are also applicable to the SBHT array. For example, the real-time algorithm-based 
fault-tolerant scheme proposed in [65] can also be easily incorporated into the SBHT 
RLS array. From the results described in Chapter 2, it appears that the Householder 
transformation method is useful in real-time high throughput applications of modern 
signal processing as well as in VLSI implementation. 
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In Chapter 3, special inherent natures of a QRD recursive LS systolic array are 
used to design a real-time, low-cost algorithm-based fault-tolerant system. By proper 
design of the artificial desired response, the residual output serves as a concurrent er- 
ror detector. The disadvantages of the CSE scheme to detect fault are thus prevented. 
The proposed method requires the same complexity as that of the CSE scheme in 
[17] without hindering the throughput rate of the system. Two methods, FFL and 
CSE, are then introduced to locate the faulty row. For both methods, the recovery 
latency is achieved in O(p) time. However, the recovery latency of the CSE method is 
generally less than that of the FFL method in that parallel execution is possible and 
multiple ports can be accessed. Once the faulty row is determined, an order-degraded 
reconfiguration is performed so that the system is operating in an gracefully degraded 
manner. Any single fault occurring in the system will not cause a catastrophic degra- 
dation of the system and thus one critical requirement for many high performance 
and demanding VLSI signal processing tasks is met. 

This fault-tolerance system is general in the sense that it is independent of the 
implementation algorithms. It is applicable to Given rotation method as well as 
those methods such as Modified Gram-Schmidt, square- root free Given methods [19], 
and Householder transformations [66] etc. The LS problems with constraints such 
as MVDR beamforming is also applicable [80]. The residual method for concurrent 
error detection is robust in the sense that the probability of error detection given by a 
fault occurred equals one (for infinite-precision implementation). This scheme could 
have great impact on next generation of radar systems which may use VLSI array 
processors as their central signal processing units. 

In Chapter 4, We present an important observation that rotation parameters of the 
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QRD LS algorithm will eventually reach the quasi steady-state. With this observation, 
the dynamic range of each processing cell is then derived to insure correct operation 
of the algorithm. Also, we can prove the stability of the QRD LS algorithm under 
finite-precision implementation with this observation. The missing error detection 
and false alarm problems are considered. We present a design of the memory size 
which is overflow free without missing error detection and false alarm problems. 

In Chapter 5, the order degraded performance is derived and a geometric inter- 
pretation is given. A scheme to estimate the optimal residual under faulty situation 
is also presented. 

The multi-phase systolic algorithms proposed in Chapter 6 can be used efficiently 
to solve the eigenvalue and SVD problems based on the QR algorithm. In particu- 
lar, the eigenvectors can be obtained without global communication within the array 
using the multi-phase operations. We showed that the QR algorithm can achieve a 
parallel implementation on a single architecture. Two systolic arrays, a triangular 
and a rectangular, are proposed for multi-phase implementation. Efficient algorithm- 
based fault-tolerance schemes can be incorporated with both arrays easily. Since the 
operations in each phase belong to the same types of computations, the cell com- 
plexity is thus not increased by multi-phase operations. There is a tradeoff between 
numerical stability and complexity for both arrays. Each iteration takes 0(n) time 
unit while the time required for convergence is O(S'n), where S is the number of 
iterations. Unlike the Jacobi-like method, the convergence rate of the QR algorithm 
depends on the ratio of the eigenvalues. As a consequence, S may vary differently for 
the matrix of the same size, with or without origin shift to accelerate the convergence. 
Generally, S is in the order of 10 for the QR algorithm. While we have demonstrated 
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the advantage of the QR algorithm that can yield two multi-phase systolic algorithms 
implementable on single architectures without requiring global connections, the in- 
trinsic convergence rate for the QR algorithm makes it less attractive as compared 
to the Jacobi-like method in parallel implementation. Depending on specific system 
and hardware requirements, one approach may be more desirable than the other. 
Of course, it is most meaningful to have two basic approaches to choose from for 
real-time VLSI signal processing based on spectral decomposition. 

There are various continuing problems extended from this dissertation that can 
be done in the future. For the SBHT RLS systolic algorithm, to pursue a possible 
CORDIC implementation would be an interesting problem. An algorithm-based fault- 
tolerance has been proposed for the QRD RLS systolic algorithm. It would be a valid 
question to ask if there is any efficient fault-tolerant scheme for the transversal or 
lattice algorithms. Different from the residual estimation under faulty situration, it 
would be possible for us to ask if we can estimate the full-order optimal residual under 
the order degraded operation. Also, there are still some questions on re-convergence 
of a fault-tolerant QRD RLS systolic array. When there is a fault occurred in the 
system, if it is a transient fault, how long will this fault affect the system? In another 
words, how soon will the adaptive system recover itself from the faulty situation and 
converge. There are two questions. The first one is how long will it recover from the 
short time transient fault without any cure? Does the position of the faulty processor 
play an important role? The next question is when we switch to order degraded 
operation, do we want to reinitialize the system or to restart from the contaminated 
contents? 

Future research in the VLSI signal processing can be divided into the following 
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categories: 


1. Algorithms and architectures aspects 

To reconsider and to develop new parallel and pipelined algorithms, to de- 
sign dedicated application-specific architectures for parallel algorithms (either 
in VLSI or DSP chip sets), and to find efficient fault- tolerant schemes for special- 
purposed applications are the major issues in this category. 

2. Software Supported CAD Aspects 

To provide simulation tools for algorithms and architectures development and 
to support automatic design of complicated systems are the major concerns of 
this category. 

3. Application and Practical Implementation Aspects 

There is a large application spectrum. To name a few, the areas such as 
sonar/radar, communication, image processing, HDTV, coding, video, speech 
processing, and music are of potential interest to researcher in the VLSI signal 
processing. 
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